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1  PROBLEM  DESCRIPTION  AND  RESULTS 

1.1  Brief  Overview 

Under  the  aegis  of  the  AFOSR  grant  we  have  been  investigating  computational  and  learning  at¬ 
tributes  of  networks  of  formal  neurons.  The  formal  neurons  we  consider  are  linear  threshold  elements 
which  produce  binary  outputs  based  on  the  sign  of  a  linear  form  of  a  set  of  inputs.  In  particular, 
each  neuron  is  characterised  by  a  vector  of  weights  w,  jind  given  a  set  of  inputs  u,  produces  an 
output  V  =  sgn(w,u)  =  sgn^,  u)iUj.^  In  a  given  neural  network  eirchitecture  the  degrees  of  free¬ 
dom  reside  in  the  specification  of  the  neural  weights;  in  particular,  each  choice  of  weights  specifies 
a  particular  computation.  We  have  been  interested  in  (I)  exploring  the  theoretical  limitations  on 
what  can  be  computed  or  learnt  in  neural  network  architectures,  and  (2)  developing  and  analysing 
learning  algorithms  which  specify  weights  as  a  function  of  a  set  of  examples  of  a  computation. 

Since  information  about  any  computation  realisable  in  a  neural  network  resides  in  the  selection 
of  the  weights,  a  cogent  question  relevant  to  the  understanding  of  the  efficiency  of  this  computational 
structure  is:  How  much  information  can  be  stored  per  bit  of  weight?  A  satisfactory  resolution  of  this 
question  will  have  a  direct  import  on  the  dynamic  range  for  the  weights  that  will  be  demanded  of 
purveyors  of  neural  network  hardware.  Classical  learning  algorithms  such  as  Perceptron  Training  and 
Backpropagation  are  typically  operational  in  situations  where  there  is  no  dynamic  range  limitation 
on  the  weights;  if,  as  we  demonstrate,  dynamic  range  requirements  are  not  extreme,  then  another 
question  arises-  Do  there  exist  efficient  algorithms  for  learning  weights  tn  a  dynamic  range  limited 
network  structure?  In  the  analysis  of  these  issues  we  present  results  for  two  distinct  scenarios:  in 
one  we  consider  networks  with  weights  restricted  to  being  binary — in  a  network  of  N  neurons  this 
corresponds  to  spreading  an  available  dynamic  range  of  N  bits  uniformly  across  the  neurons — ;  in  the 
other  we  analyse  a  class  of  sparsely  interconnected  neural  networks,  a  situation  which  corresponds 
to  packing  available  dynamic  range  in  a  few  weights. 

On  another  front,  we  have  also  investigated  the  enhancement  in  computational  capability  that 
results  if  the  degrees  of  freedom  in  specifying  the  weights  are  increased.  In  particular,  we  have 

'This  is  the  basic  neural  model  proposed  by  McCulloch  and  Pitts  (1943).  A  real  threshold  can  be  incorporated  in 
the  model,  but  is  not  essential  to  our  discussion. 
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analysed  a  family  of  recurrent  neural  networks  with  the  linear  threshold  neural  model  replaced  by  a 
polynomial  threshold  model  where  each  higher  order  neuron  produces  a  binary  output  according  to 
the  sign  of  a  polynomial  form  of  the  inputs.  The  intuition  here  is  that  the  extra  degrees  of  freedom 
in  specifying  the  polynomial  coefficients  (weights)  should  result  in  more  powerful  computational 
structures. 

We  have  also  been  developing  a  theoretical  basis  in  which  the  cinalysis  of  the  above  and  similar 
problems  can  be  carried  out.  The  framework  that  is  evolving  rests  upon  a  statistical  notion  of 
the  computational  capacity  of  a  network  architecture  and  an  associated  cdgorithm.  The  notion  of 
capacity  is  turning  out  to  be  a  fundament  in  the  analysis  of  intrinsic  computational  and  learning 
attributes  of  computational  structures,  and  has  been  brought  to  prominence  in  the  PAC  learning 
model  of  Valiant  (1984). 

In  the  following  we  briefly  summarise  the  results  we  have  obtained,  providing  a  road  map  as  it 
were  to  the  attached  papers  where  we  carry  out  more  searching  investigations  of  the  subject  matter. 
We  also  present  prognoses  and  brief  summaries  of  work  in  progress. 

1.2  Binary  Weights 

Consider  a  neuron  with  weights  restricted  to  be  binary,  Wi  €  {—1, 1),  i  =  1,  . . . ,  n.  If  we  consider 
inputs  to  be  binary  as  well,  each  assignment  of  weights  to  the  neuron  results  in  the  realisation  of 
one  of  2"  distinct  Boolean  functions  of  n  Boolean  variables,  and  in  particular,  one  of  2”  majority 
functions  of  a  set  of  n  literals;  for  every  u  €  {—1, 1}",  /(ui, . . .  ,  Un)  =  sgn^”_.  u  .  Uj.  An  immediate 
question  is  what  is  the  computational  capacity  of  such  an  element  (vis  a  vi.-  a  neuron  where  the 
weights  are  allowed  to  be  real).  Consider  a  rjuidomly  chosen  m-set  of  points  u\  . . . ,  u”*  drawn  from 
{—1,1}”  with  components  drawn  from  a  sequence  of  symmetric  Bernoulli  trials  and  an  associated 
set  of  desired  binary  ({—1,1})  classifications  . . . ,  t)"*.  We  are  interested  in  whether  there  exists 
(with  high  probability)  a  vector  of  binary  weights  w  €  {—1,1}"  such  that  each  of  the  points  is 
classified  correctly; 

n 

sgn^w<uf  =  v“,  a=l,...,m.  (1) 

i=l 

It  is  clear  that  for  each  i  =  1,  . . . ,  n,  the  binary  weight  Wi  £  B  has  to  retain  information  about 
the  corresponding  m  input  components  u},  . . . ,  «J",  and  the  desired  classifications  v\  The 

difficulty,  of  course,  is  that  we  have  to  store  information  about  m  bits  in  a  single  bit,  and,  especially 
if  the  learning  procedure  is  on-line,  it  may  perhaps  appear  doubtful  if  this  is  possible  at  all. 

Let  us  consider  without  loss  of  generality  that  the  m  desired  classifications  are  all  -(-1.  The 
a.ssignments  in  (1)  are  hence  more  likely  to  be  realised  if,  for  each  fixed  »,  min,,  Eiu,u“  can  be  made 
as  large  as  possible;  i.e.,  the  probability  that  a  weight  has  the  same  sign  as  a  corresponding  pattern 
component  is  made  as  large  as  possible.  Using  randomisation  in  the  algorithm  as  a  tool  we  show  that 
for  local  on-line  procedures  supmiuo  Eu^juf  =  0(l/n)  for  each  t.  Here  the  sup  is  taken  over  all  local, 
on-line  procedures.  For  local  off-line  procedures  the  analogous  result  issupminaEw.u”  =  0(l/\/n). 
A  detailed  development  of  these  and  related  results  are  contained  in  [1],  and  is  included  as  the  first 
attachment.  The  basic  conclusion  that  may  be  drawn  from  these  investigations  is  that  fairly  large 
capacities  are  attainable  for  networks  of  neurons  with  binary  weights,  and  that  these  capacities  are 
comparable  to  those  when  the  weights  are  unrestricted  reals. 

While  large  capacities  may  be  attainable  in  principle,  there  are  practical  difficulties  in  the  design 
of  binary  weight  learning  algorithms;  learning  binary  weights  is  equivalent  to  integer  programming 
and  is  NP-complete.  We  might  hence  anticipate  that  for  any  binary  weight  learning  algorithm  there 
exists  at  least  one  problem  instance  which  is  intractable  (i.e.,  takes  exponential  time).  We  have 
investigated  the  use  of  random  algorithms,  however,  as  a  technique  to  partially  circumvent  the  NP- 
completeness  of  the  problem  by  providing  good  average-case  performance.  In  particular,  we  have 
developed  a  family  of  local,  on-line  randomised  algorithms  (dubbed  Directed  Drift)  which  provide 
good  average-case  performance  in  certain  regimes.  Detai’s  are  provided  in  [2]  which  constitutes  the 


AFOSR  89-0523 


3 


next  attachment.  R.  Meir  has  communicated  to  us  that  simulations  indicate  that  for  local,  on-line 
algorithms  Directed  Drift  appears  to  have  an  optimal  character. 

Prospectus  We  have  been  investigating  further  randomisation  ideas  in  the  development  of  on-line 
and  off-line  learning  algorithms  and  these  are  to  be  reported  in  [3].  Combining  these  ideas  with  the 
Directed  Drift  family  of  algorithms  we  have  developed  heuristics  for  learning  binary  weights  for 
arbitrary  network  configurations.  Early  simulation  results  indicate  very  promising  performance  of 
these  randomised  algorithms.  The  implications  to  hardware  development  can  be  profound  as  these 
early  results  indicate  that  it  suffices  to  have  very  small  dynamic  ranges  for  weights  (one  bit  suffices 
in  many  applications),  and  with  on-line  algorithms  such  as  those  described  here  this  can  lead  to  very 
low  complexity  neural  network  hardware  with  on-line  learning  capabilities. 

We  have  begun  to  study  batching  in  on-line  algorithms  as  a  tool  in  the  study  of  the  tradeoffs 
in  performance  and  complexity  when  we  move  from  on-line  to  off-line  procedures.  In  particular,  we 
have  been  examining  a  randomised  batch  version  of  Directed  Drift,  the  on-line  algorithm  described 
in  the  previous  report.  (Batch  learning  is  a  process  intermediate  in  complexity  between  on-line 
and  off-line  learning  where  learning  still  takes  place  in  a  sequence  of  trials,  but  a  (small)  batch  of 
examples  is  available  to  the  algorithm  at  each  epoch.)  Surprising  and  unlooked  for  results  have 
emerged  in  this  consideration  [4].  While  Directed  Drift  converges  very  rapidly  when  the  number  of 
examples  is  small,  it  slows  down  substantially  when  the  number  of  examples  becomes  large,  a  regime 
where,  effectively,  the  examples  are  numerous  enough  to  uniquely  identify  the  generating  function. 
In  this  latter  regime,  batch  versions  of  the  algorithm,  however,  show  improvements  in  convergence 
time  of  several  orders  of  magnitude,  even  for  very  small  batch  sizes.  Improvements  saturate  quickly 
with  increasing  batch  size  leading  to  the  conjecture  that  a  modified  on-line  learning  algorithm  with 
very  small  batch  sizes  can  achieve  off-line  performance.  These  and  other  results  are  to  be  reported 
at  the  Conference  on  Neural  Networks  for  Computing,  Snowbird,  1992. 

1.3  Sparse  Networks 

Sparsity  in  networks  can  arise  either  as  a  result  of  architectural  constraints  or  can  arise  as  a  conse¬ 
quence  of  damage  to  the  network.  In  either  case  sparsity  can  be  viewed  as  a  situation  where  a  total 
available  dynamic  range  in  bits  for  the  weights  is  distributed  among  a  few  weights  in  the  network. 

When  sparsity  occurs  as  a  result  of  damage  to  the  network,  the  principal  concern  is  whether 
the  network  continues  to  function  effectively,  i.e.,  whether  the  network  is  structurally  robust.  In 
contexts  with  large  interconnectivity,  neural  folklore  tells  us  that  networks  will  continue  to  function 
efficiently,  albeit  with  some  degradation,  in  the  presence  of  component  damage  or  loss.  We  have 
rigourously  examined  this  premise  for  a  fully-interconnected  network  of  neurons  in  an  associative 
memory  application  by  introducing  the  “devil”^  in  the  network  as  an  agent  that  produces  sparsity  by 
snipping  connections  between  neurons.  A  consideration  of  a  malicious  (or  at  best  neutral)  devil  which 
removes  connections  at  random  yields  the  following  strong  validation  of  the  robustness  hypothesis: 
in  a  network  of  n  neurons  each  neuron  needs  retain  only  of  the  order  of  logn  random  links  (on 
average)  of  a  total  of  n  possible  links  with  other  neurons  for  useful  computational  properties  to 
emerge.  Memory  storage  capacity  degrades  very  gracefully  as  the  probability  of  losing  links  increases. 
Details  are  presented  in  [5,  6]  which  constitute  the  next  two  attachments. 

When  network  sparsity  arises  as  a  consequence  of  architectural  constraints  it  may  be  possible  to 
demarcate  classes  of  problems  which  are  well  suited  to  the  sparse  structure.  We  have  investigated  this 
in  a  recurrent  neural  network  situation  where  the  neurons  are  partitioned  into  fully-interconnected 
sub-blocks  with  few  or  no  connections  between  blocks  {nested  sparsity  and  block  sparsity,  repec- 
tively).  For  an  associative  memory  application  we  identify  memories  to  be  stored  as  codewords  and 
the  collection  of  admissible  memories  as  codes — the  neural  network  is  a  decoder  which  corrects  er¬ 
rors  in  memories.  We  show  that  for  networks  of  n  neurons  in  nested  or  block  architectures,  storage 


*WeU,  maybe  an  imp. 
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capacities  as  large  as  2'"  memories  for  any  c  <  1  can  be  achieved  for  a  family  of  codes  (admissible 
sets  of  memories)  which  is  exponential  in  size.  More  precise  statements  of  the  results  and  details  of 
constructions  and  proofs  can  be  found  in  the  attachments  [5,  7]. 

Prospectus  Sparse  network  structures  are  again  practically  motivated  as  hardware  would  appear 
to  favour  certain  regular,  sparse  interconnectivity  patterns.  The  characterisation  of  problems  best 
fitted  to  these  structures  is  still  an  open  question  which  the  above  investigations  answer  only  partially. 
In  an  effort  to  understand  how  computation  and  capacity  scale  with  increasing  sparsity  we  have 
done  extensive  simulations  in  a  feedforward  network  environment.  The  results  are  reported  in 
attachment  [20].  In  general,  capacity  decreases  with  increasing  sparsity  roughly  in  proportion  to  the 
loss  in  the  degrees  of  freedom.  This  is  reflected  by  a  concurrent  improvement  in  learning  times. 

1.4  Polynomial  Neural  Interactions 

Higher  order  neural  networks  have  been  proposed  in  the  literature  as  a  means  of  enhancing  the 
computational  capability  of  these  networks.  A  higher  order  neuron  is  a  polynomial  threshold  element 
which  computes  the  sign  of  a  polynomial  form  of  its  inputs.  In  particular,  a  higher  order  neuron  of 
degree  d  and  n  inputs  is  characterised  by  a  set  of  (J)  weights  ti'i,,. .  1  <  »i  <  •  •  •  <  »<i  <  In 

response  to  an  input  u  =  (uj  •  ■  «„)  it  produces  an  output 

t/  =  sgn  u;.,, 

1<||<  <ij<n 

The  increased  degrees  of  freedom  give  rise  to  a  commensurate  improvement  in  computational  ca¬ 
pability.  We  have  obtained  rigourous  results  on  the  computational  gains  that  accrue  in  recurrent 
networks  of  higher  order  neurons.  The  main  results  can  be  summarised  as  follows:  the  information 
storage  capacity  of  a  recurrent  higher  order  neural  network  is  of  the  order  of  one  bit  per  polynomial 
interaction  coefficent  (weight),  this  result  being  independent  of  the  choice  of  the  algorithm.  We  pro¬ 
vide  exact  results  on  associative  storage  capability  and  error  correction  for  a  variety  of  algorithms 
in  attachments  [9,  10]. 

We  have  also  carried  out  a  complementary  analysis  of  the  structure  of  fixed  points  in  symmetric 
recurrent  higher  order  networks  when  the  weights  are  standard  normal  ^"(0,1)  random 

variables.  This  corresponds  to  higher  order  spin  glasses  in  statistical  physics.  We  obtain  expressions 
for  the  expected  number  of  fixed  points  as  a  function  of  their  margin  of  stability.  In  particular,  we 
show  that  there  exists  a  critical  margin  of  stability  below  which  the  expected  number  of  fixed  points 
increases  exponentially  in  n,  and  above  which  the  expected  number  of  fixed  points  actually  decreases 
exponentially  with  n.  A  formal  statement  of  these  results  and  proofs  is  provided  in  attachment  [11]. 

On  another  tack,  we  have  developed  algorithms  for  recurrent  neural  networks  for  an  associative 
memory  application.  In  particular,  we  have  shown  that  it  is  possible  to  store  memories  with  sim- 
ulaneous  memory-specific  as  well  as  feature-  (or  direction-)  specific  error  correction  while  retaining 
high  storage  capacity  [13,  14]. 

Prospectus  The  computational  gains  that  accrue  from  higher  order  neural  networks  agree  with 
intuition — they  correspond  with  the  increase  in  the  degrees  of  freedom  in  the  network.  An  attrac¬ 
tive  feature  of  these  networks  is  that  each  higher  order  neuron  can  be  replaced  by  a  functionally 
equivalent  small  network  of  linear  threshold  elements  so  that  hardware  can  be  standardised  with  the 
formal  neuron  (linear  threshold  element)  as  the  basic  building  block.  The  low  complexity  algorithms 
described  in  [9,  10]  indicate  that  the  higher  capacity  latent  in  higher  order  networks  can  actually  be 
realised.  There  are  open  issues  on  the  nature  of  problems  that  efficiently  fit  networks  of  polynomial 
threshold  elements;  for  instance,  it  is  not  known  whether  the  family  of  poly-sized  two  layer  higher 
order  networks  is  functionally  strictly  subsumed  within  the  class  of  poly-sized  three  layer  higher 
order  networks.  We  are  investigating  these  issues. 
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1.5  Capacity  and  Learning  Sample  Complexity 

A  conunon  thread  running  through  our  analysis  of  the  nature  of  information  storage  in  the  weights 
of  a  neural  network  has  been  the  notion  of  the  statistical  capacity  of  a  network  architecture  and  ein 
algorithm.  This  is  a  distribution  dependent  notion  which  captures,  loosely  speaking,  the  largest  size 
of  a  rsmdomly  specified  set  of  input  points  which  can  be  mapped  into  a  corresponding  independently 
specified  set  of  output  points  with  high  probability  by  the  network  specified  by  the  algorithm.  This 
statistical  notion  of  capacity  plays  a  critical  role  in  determining  the  minimum  size  of  labeled  sample 
(the  sample  complexity)  needed  to  identify  a  given  function  realisable  in  a  given  network  architecture. 
The  statistical  notion  of  capacity  that  we  espouse  is  related  to  a  combinatorial  parameter  known  as 
the  VC-dimension  which  is  a  critical  parameter  in  a  distribution-free  learning  model.  We  develop  a 
fairly  general  set  of  definitions  of  capacity  in  the  attachment  [12]  and  have  presented  the  material 
in  [15].  Definitions  of  capacity  can  also  be  found  in  the  earlier  reported  work,  and  in  particular, 
in  [5,  6,  7,  9,  10]. 

One  particular  problem  we  have  considered  is  the  gains  that  may  be  realisable  in  computational 
capacity  if  errors  are  permitted  in  the  output.  Our  main  results  here  are  that  allowing  a  linear 
number  of  output  errors  improves  the  constants,  but  not  the  rate  of  growth  of  capacity.  The  exact 
constants  depend  upon  what  protocol  governs  the  errors.  Exact  expressions  and  derivations  are 
provided  in  attachment  [16,  17]. 

As  aforementioned,  capacities  govern  the  sample  complexities  needed  for  learning.  A  related 
issue  of  both  practical  and  theoretical  interest  is  the  rate  of  convergence  that  can  be  expected  of 
a  learning  algorithm  as  a  function  of  the  sample  size.  In  an  effort  to  shed  light  on  what  may  be 
the  worst  case  behaviour  we  considered  the  classical  nearest  neighbour  algorithm  which  has  been 
suggested  to  be  representative  of  the  best  non-parametric  learning  algorithms.  (The  specification 
of  a  host  neural  network  architecture  for  a  learning  algorithm  in  sharp  contradistinction  imposes  a 
parametrisation.  The  learning  algorithm  seeks  to  find  weights — the  parameters — for  the  eirchitecture 
which  best  fit  the  data.  If  the  parametrisation,  i.e.,  the  choice  of  architecture,  is  appropriate  this 
should  result  in  substantially  better  behaviour  than  a  non-parametrised  approach.)  It  is  known 
that  in  the  infinite  sample  limit  the  nearest  neighbour  algorithm  has  performance  no  worse  than 
twice  the  Bayes  risk,  and  an  old  result  of  T.  M.  Cover  shows  that  for  one-dimensional  feature  spaces 
the  convergence  rate  to  the  infinite  sample  limit  is  as  rapid  as  ©(m“^)  where  m  is  the  sample 
size.  In  attachment  [18,  19]  we  present  a  precise  statement  of  a  generalisation  of  Cover’s  result 
to  n-dimensional  feature  spaces  which  has  been  hitherto  lacking.  We  show  that  the  performemce 
of  the  nearest  neighbour  algorithm  converges  to  its  infinite  sample  limit  as  rapidly  as  0(nj“^/"), 
where  n  is  the  dimensionality  of  the  space  and  m  is  the  sample  size.  This  result  holds  under  mild 
conditions  on  the  input  distribution.  (Alternatively,  the  sample  complexities  needed  for  learning 
are  exponential  in  n.)  Clearly  Bellman’s  “curse  of  dimensionality”  is  made  evident  in  the  drastic 
reduction  in  convergence  rates  as  the  input  dimensionality  increases. 

On  another  tack,  we  have  been  extending  these  results  to  precisely  estimate  the  value  of  side- 
information  in  learning,  with  special  reference  to  a  problem  proposed  by  T.  Cover;  How  many 
unlabelled  examples  is  each  labelled  example  worth  in  learning?  The  question  has  import  when 
unlabelled  examples  exist  in  relative  profusion,  but  there  are  few  labelled  examples  or  where  labelling 
examples  is  expensive.  The  answer  in  general  depends  on  how  much  side-information  is  present.  We 
have  early  results  and  are  investigating  further  [20]. 

Prospectus  We  are  seeking  to  combine  the  various  elements  described  above  into  a  theory  of 
evolutionary  learning  in  a  neural  network  setting.  A  result  of  J.  S.  Judd  indicates  that  the  prob¬ 
lem  of  deciding  whether  a  given  problem  instance  can  be  loaded  into  a  given  architecture  may  be 
intractable  (read  NP-complete).  It  may  be  possible  to  circumvent  this  problem  by  adopting  a  suit¬ 
able  evolutionary  protocol  where  the  network  architecture  is  allowed  to  grow  in  time.  To  keep  the 
complexity  manageable  we  have  been  considering  binary  weight  networks,  using  the  randomised  al¬ 
gorithms  we  have  been  developing  for  training  at  every  stage  of  network  evolution.  Training  periods 
at  each  stage  of  the  evolution  are  governed  by  the  statistical  capacity  of  the  network  at  that  stage  in 
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the  evolution.  Results  are  only  preliminary  at  this  stage.  The  convergence  rate  calculations  for  the 
newest  neighbour  algorithm  indicate  the  importance  of  choosing  a  proper  evolutionary  protocol.  A 
very  loose  evolutionary  structure  would  be  essentially  unparametrised  leading  to  very  large  times 
for  convergence. 


1.6  Coin  Tossing  and  Randomised  Algorithms 

The  last  attachment  [21]  is  not  directly  related  to  issues  in  neural  network  compuatation,  but  indi¬ 
rectly  in  that  it  examines  limitations  of  randomised  procedures.  The  basic  question  analysed  is  the 
expected  minimum  duration  of  a  coin  tossing  game  which  carries  the  essence  of  several  randomised 
search  procedures  in  cryptography.  The  paper  provides  precise  and  rather  complete  estimates  of  the 
minimum  duration  of  the  game  and  provides  constructions  for  generating  the  optimal  strategy. 

Prospectus  We  are  seeking  efficient  learning  algorithms  to  learn  discrete  weights  for  neural  net¬ 
works.  In  this  randomisation  is  turning  out  to  be  a  very  effective  tool  to  attack  some  formally 
intractable  learning  problems.  We  are  investigating  among  other  issues,  a  characterisation  of  the 
effectiveness  of  randomisation. 
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How  Much  Information  Can  One  Bit 
of  Memory  Retain  About 
a  Bernoulli  Sequence? 
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Abstract — The  maximin  problem  of  the  maximization  of  the 
minimum  amount  of  information  that  a  single  bit  of  memory 
retains  about  the  entire  past  is  investigated.  SpeciHcally,  a 
random  binary  sequence  of  ±  1  inputs  drawn  from  a  sequence  of 
symmetric  Bernoulli  trials  is  given.  A  family  of  (lime  dependent, 
deterministic  or  probabilistic)  memory  update  rules  that  at  each 
epoch  produce  a  new  bi.  (  -  I  or  1)  of  memory  depending  solely 
on  the  epoch,  the  current  input,  and  the  current  state  of  memory 
is  also  given.  The  problem  is  to  estimate  the  supremum  over  all 
possible  sequences  of  update  rules  of  the  minimum  information 
that  the  bit  of  memory  at  epoch  (n  +  l)  letains  about  the 
previous  n  inputs.  Using  only  elementary  techniques  we  show' 
that  the  maximin  covariance  between  the  memory  at  epoch 
(n  +  1)  and  past  inputs  is  0(l/n),  the  maximum  average  covari¬ 
ance  is  0(l/w),  and  the  maximin  mutual  information  is  ft(l/ n^). 
In  a  consideration  of  related  issues,  w'e  also  provide  an  exact 
count  of  the  number  of  Boolean  functions  of  n  variables  that 
can  be  obtained  recursively  from  Bt>oiean  functions  of  two 
variables,  discuss  extensions  and  applications  of  the  original 
problem,  and  indicate  links  with  issues  in  neural  computation. 

Index  Terms — Bernoulli  sequence.  Boolean  functions,  mem¬ 
ory,  covariance,  mutual  information,  neuron,  capacity. 

I.  A  Problem  in  Information  Storage 

ANOS  KOMLOS  posed  the  following  problem:  Given 
a  single  bit  of  memory  and  a  random  binary  sequence 
of  inputs,  at  any  epoch  in  time  what  is  the  maximum 
amount  of  information  that  the  memory  can  retain  about 
the  entire  binary  sequence? 

More  precisely,  let  be  a  sequence  of  symmetric 

Bernoulli  trials,  with 

-1  with  probability  1/2 

1  with  probability  1/2. 

Let  Af„  e  { -  1, 1)  denote  the  state  of  a  one  bit  memory  at 
epoch  n.  The  memory  states  are  updated  by  a  sequence 
of  (possibly  random)  Boolean  functions,  /„,  of  two  Boolean 
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variables:  M„^^  =  fSbt„,X„).  (The  initial  memory  state, 
M^,  is  arbitrary.)  For  each  n  we  are  required  to  estimate 

!„=  max  min  (1) 

I  S  t.'  S  n 

Is  /„  bounded  away  from  zero?  Can  we  identify  functions 
/*,•  •  •,/„*  that  achieve  /„? 

Komlos’  problem  can  be  generalized  in  various  ways 
with  other  measures  of  information  used  instead  of  the 
covariance.  Specificaliy,  we  can  consider  the  determina¬ 
tion  of 

max  min  l{M„^^,X^),  (2) 

/,,•  •  ••/,  Isk&n 

where  /(A/„,,;A'*)  denotes  the  mutual  information  of 
and  X^.  Another  measure  of  (average)  information 
about  the  past  that  we  investigate  is 

max  -  i:  E{M„^M.  (3) 

-  *-i 

The  following  are  the  main  results': 


The  last  result  is  due  to  Komlos,  Rejto,  and  Tusnady  [1] 
who  have  recently  investigated  the  average  covariance, 
K„,  in  a  control  problem.  In  this  paper,  we  show  that  the 
result  holds  as  a  direct  consequence  of  arguments  ad¬ 
duced  in  the  consideration  of  the  maximin  problem  l„. 
We  also  show  that  the  maximum  average  covariance  is 
0(1/Vn)  when  we  allow  update  rules  with  unlimited 
access  to  past  inputs.  Specifically,  let  denote  the  family 

'On  Notation.  If  {x„]  and  (y,)  are  positive  sequences,  we  denote: 
x„  -  0(y,)  if  there  is  a  positive  constant  K  such  that  x„/y„<  K  lot  all 
n;  X,  -  n(y„)  if  there  is  a  positive  constant  L  such  that  x„/y„>  L  for 
all  n;  i, -e{y„)  if  x„-0(y„)  and  x„-n(y,);  and  x,  ~  y„  if  x,/ 
y,  -•  1  as  n  -*®. 
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of  all  update  rules  mapping  (-1,1)''  into  { -  1, 1}.  Then 

feS^n  ^  Virn 

In  the  proof  of  the  results,  it  also  develops  that  the 
maximin  and  average  absolute  value  of  covariances  is  also 
0(l/n),  with 

1  2 

—  <  max  min  1£(  < -, 

n  fi.-  -.f^  1  s*  sn  n 

and 

-<  max  -  Y. 

n  !.-■  ■  /-  'J  i  -i  n 

If  we  restrict  attention  to  a  reasonable  family  of 
update  rules — monotone  symmetric  rules — we  demon¬ 
strate,  in  fact,  that  maxmin  =  l/n  and 

maxmin  + A"*)  ~  l/2n^  In 2.^ 

In  Section  III,  we  will  conclude  by  looking  briefly  at 
some  related  issues.  In  particular,  we  will:  provide  an 
exact  count  of  the  number  of  Boolean  functions  of  n 
variables  that  can  be  obtained  by  a  recursive  application 
of  in  - 1)  Boolean  functions  of  two  variables,  with  the 
variables  taken  in  sequence — there  are  exactly  (0.4)6"  -i- 
1.6  such  Boolean  functions — examine  extensions  of  the 
results  and  raise  some  open  questions  when  more  than 
one  bit  of  memory  is  available;  and  link  these  results  with 
issues  in  information  storage  in  neural  networks. 

II.  Information  Bounds 
A.  Probabilistic  Rules 

In  the  most  general  setting  the  update  rules.  A/* = 
Xj,),  arc  probabilistic  and  can  be  characterized  in 
terms  of  probabilities  conditioned  upon  the  epoch,  k,  the 
current  state  of  memory,  and  the  current  input.  A"*, 
as  follows:  if  Af*  =  /  e  {  -  1, 1)  and  AT*  = ;  e{  -  1, 1),  then 
set 


I  -  with  probability  p* (/,;), 

Af  *  + 1  ”  { 

[  A/|t  with  probability  p(i(/,y)  =  1  -  p^(£,y). 
Alternatively, 

P*('./)  =  =  ‘1A/*  =  =y). 

Each  update  rule,  /*,  can  hence  be  defined  by  four 
(independently  specifiable)  probabilities,  p*(-l, -1), 
p*(-l,l),  p*(l,-l),  and  p*(l,l),  each  of  which  repre¬ 
sents  the  probability,  given  the  epoch,  and  current  values 
of  memory  and  input,  that  the  memory  update  results  in  a 
change  of  sign  of  memory. 

We  define  the  family  of  monotone  symmetric  update 
rules  to  be  update  rules  satisfying:  p^(- 1,  -  1)  =  p^(l,  1) 

‘We  conjecture,  in  fact,  that  m.ix  min  E(  A/., , ,  .Vj )  =  1  /  n  and 
niaxniin  K  .V, )  '  1  /  hr  In  2.  wiih  the  maximum  being  taken  over 

all  functions  T,.  .  f„  Tbis  is  not  true  for  absolute  s.ilues  of  eovari- 

..nces  lloweser.  J  Komlos  has  recenth  commumcaled  a  construction  to 
us  that  deinonsirales  ni.ixmin  IFI  .3/^  ,  ,  ,Vj  )|  >  l/n 


=  0,  and  p^(  -1,1)  =  p^(I,  -  1),  /  >  1.  The  first  of  the  two 
symmetry  requirements;  in  particular,  enforces  no  change 
in  memory  state  if  the  current  input  agrees  with  the 
current  state  of  memory — an  intuitively  appealing  proce¬ 
dure. 

We  first  evaluate  the  unconditional  probabilities 
5k  =  ^{A/*  =  -l}. 

(Clearly,  o)^  =  l-  we  introduce  the  additional  nota¬ 
tion  for  later  convenience.)  Let  us  assume,  without  loss  of 
generality,  that  we  generate  the  initial  value  of  the  mem¬ 
ory,  A/,,  by  flipping  a  fair  coin.^  Hence,  <»>,  =  aJ,  =  1/2. 
For  j  >  1,  define 

.A,^p,(-l,-l)  +  p,(-l,l)  +  p/l,-  l)  +  p/l,l).  (4) 

For  convenience,  let  us  also  define 

Po(  -  -  1)  =  Po(  -  1. 1)  =  Po(l.  -  1)  =  Po(l.  1)  =  1/2. 

Assertion  I:  For  A:  =  0. 1,- •  ■,the  unconditional  proba¬ 
bilities  for  the  state  of  the  memory  at  epoch  k  +  \  are 
given  by 

‘^*+1=  E  r[p,(-i.-i)+p, (-1.1)1  n  (i-tI 


S*>i=  I  7[P.(1.1)  +  P.(1.-1)1  n  (i-  tI-  (6) 

Proof:  We  can  obtain  the  following  recursion  by  not¬ 
ing  that  5^  =  1  —  a>*. 

"t..  I  =  +  y  [  -  1.  -  1)  +  P*(  -  1. 1)1 

-y(p*(i.l)  +  p*(i.-i)]. 

lilit+i  =  w*  -  y  [  P*(  - 1.  - 1)  +  Pi(  -1.1)] 

+  y[p*(l.l)  +  P*(l.-l)]- 
The  result  can  now  be  established  by  induction.  □ 

For  A:  s  1,  let  us  now  define 
«#'*  =  [lll*P*(-l.l)  +  a)*p*(l.-l)] 

-[5*P*(-l.-l)  +  a>*P*(l,l)].  (7) 
Assertion  2:  For  any  choices  of  n  and  k  with  k  ^n, 


F{Af„,,=Aj=- 1+,^*  n 


l-k  +  l 


-!)!■ 


J  -  k  *  I 


(9)  I 


Remark:  We  adopt  the  convention  n^^.,(  )=  1  if  r  >  j. 

'The  inili.il  choice  of  memory  bu  can  base  no  information  aboul  the 
(lata  sequence  lo  come.  The  ohiious  optim.il  (iroccdurc  would  be  fo 
chiKjse  Ihc  update  rule  . 
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Proof:  To  prove  the  assertion  we  use  double  induc¬ 
tion  on  k  and  n. 

Base:  For  every  choice  of  n  >  1  and  k  =  «,  we  have 

=  Y  [1  -  F.(  -  1.  -  1)  +  -  M)I 

/nductive  Hypothesis:  Assume  that  for  some  choice  of  n 
and  k  <  n,wt  have 

if  n-\  I  ^  s 

P(M„  =  X,}-~  !  +  </,, ^  ■ 

Now  consider 

=  P{Af„,,  =  l.M,  =  l.>r*  =  l) 

4-/>{A/„,,  =  l,A/„=-l.;r*  =  l) 


All  entirely  analogous  procedure  yields 


=  = -!}  =  -[/»„{  1.1)  + F.(l.-l)] 


+  (l-yjF{Af„  =  Ar*=-l}. 
Combining  the  two  results  gives 

=  P{M„,,  =  X,  =  \]  +  P{M„^,  =  X,  =  -\} 


4>n  I 


The  base  of  the  induction  argument  establishes  the  first 
part  of  the  assertion,  (8),  for  k  =  n,  and  the  inductive 
hypothesis  completes  the  induction  for  k  <n.  Equation 
(9)  follows  trivially  from  the  observation  that  = 

-X,]  =  l-P[M„,,  =  X,].  □ 

Assertion  3:  For  n  >  1  and  I  <k  <n, 

PIM„.,-X,  =  l) 


^p{M„,y^l\M„  =  \,X,-\}P{M„^),X,  =  i} 

+  =  -\,X,  =  l)P{M„=  -  l,X,  =  \). 

(10) 

Now,  given  M„,  the  random  variable  is  condition¬ 

ally  independent  of  the  random  variable  X,^.  Hence, 

=  ^(1-P«(l.-1)  +  1-Pn(l.l)].  (11) 

In  similar  fashion,  we  obtain 
P[M„,,-\\M„--\,X,  =  \) 

=  ^[p«(-l.-l)  +  Pn(-l.l)].  (12) 

We  now  claim  that 

P{M„--\,X,  =  \]^]^~P[M„  =  X,  =  \).  (13) 
In  fact,  we  have 

=  - l,jr*  =  1)  = /'{A/„  =- 11^*  =  !}/>{ AT*  =  1) 
-\{\-P{M„  =  \\X,  =  \]) 

1/  />{A/„  =  1,^*  =  1}\ 

p[x,=i)  y 

so  that  (13)  follows.  Substituting  the  results  of  (11)-(13)  in 
(10),  we  obtain 

/’{W„d  =  >^*  =  l}“:j(Pn(-l.-l)  +  Pn(-l.l)] 


0)1  1,  "  I  ik  \ 

+  7  E  [{S'*P,(-l,l)-o)*p,(l, -1)} 

^  I  -  *  + 1 

+  {5,P.(-l,-l)-a.*p.(l.l)}J  n 

P{M„.i-X^^-l) 

Si  1  ”  I  lb  \ 

“7  E  [{"*p.(-i.i)-“'*p,(i. -1)1 

+  {5tP,(-l,-l)-o)*p,(l,l)}]  n^(l-y). 

P{M„^,- -l,X,  =  l] 

{Ol  1  ”  /  *A  \ 

+  7  E  [{5;*P,(-1.1)-^.)IP,(1,-I)) 

^  i  -  *  +  1 

+{“'*p.(-i. -i)-"*p.(i.i)}]  n 

P{Af,,,  =  l,^*  =  -l} 

wl  1  {  it/  \ 

-7  E  [{5*p.(-l.l)-0.»p.(l,-l)} 

^  1  -  *  ♦  1 


P{A/,  =  A',  =  l}. 


{i5*p,(-l.-l)-o),p,(l,l)}]^  PI 
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Proof:  These  results  can  be  verified,  as  in  Assertion 
2,  by  induction.  □ 

Remark:  The  previous  identities  simplify  considerably 
for  the  family  of  monotone  symmetric  update  rules;  in 
particular,  update  rules  governed  by  probabilities  of  the 
form  py(-l, -l)  =  p^(l,l)  =  0,  and  1,1)  =  p/1,  -  1) 
=  Py,  j  >  1.  Substituting  in  (4)-(7)  we  have  i/i*  =  2p*, 
0)^  =  (^>1^  =  1/2,  and  =  p^,  for  k  >\.  Substituting  these 
relations  in  the  above  expressions,  we  have 

=  7[i  +  Pl  fl  04) 


-1 

1 

Af, 

-1 

-1 

1 

1 

-  1 

1 

Fig.  1.  Odd  parity  update  for  My. 


=  1, A’*  =  -  1}  =  P{ =  -  I,^*  =  1) 

=  7  1-P*  fl  (l-Py)  •  (15) 

4  [  y -*  + 1 

B.  Maximin  Covariance 

A  direct  application  of  (8)  and  (9)  yields  the  following 
general  result. 

Assertion  4:  For  any  choice  of  positive  integers  n  and  k 
with  I  ^k  ^n. 


Af* 

AT* 

-1 

1 

-1 

-1 

1 

1 

1 

-1 

Fig.  2.  Odd  parity  update  for  My 


( 16)  Evaluating  the  various  parameters  we  obtain 

>-*  +  i  \  2  /  .  , 


where  4i,  and  are  given  by  (4)  and  (7),  respectively. 
Some  examples  may  serve  to  fa  the  result. 

Example  —  Follow  the  Leader:  Consider  the  choice  of 
rule  Mj^.1  =  Xj,  j  2: 1,  corresponding  to  the  selection 

Py(-l,-l)=P,(l.l)=0, 

P,(-1,I)  =  P/1,-1)  =  1. 

From  the  defining  equation  (4),  wc  clearly  have  il/,  =  2  for 
every  j>  1.  Applying  (5)  and  (6),  we  have  the  uncondi¬ 
tional  probabilities  of  the  state  of  the  memory  given  by 
<u*  «=  5*  =  1  /2,  so  that  applying  (7)  we  have  =  1.  Hence, 

in  agreement  with  the  intuitive  result.  Consequently, 

mints„£(iW„^,A’*)  =  0. 

Example  ~  Parity:  Consider  the  sequence  of  update 
rules  which,  at  any  epoch  n,  set  Af„  +  ,  =  1  iff  an  odd 
number  of  the  random  variables,  have  taken 

on  the  value  1.  The  update  rules  determining  and 
k^2  are  shown  in  Figs.  1  and  2.  The  probabilities 
corresponding  to  the  update  rules  are,  hence, 

p,(-l,-l)  =  p,(l,l)  =  0, 

P,(-1.1)  =  P.(1.-1)  =  1. 
when  k  =  \,  and 

p*(-i-i)  =  pai.-i)  =  o. 

Py(-l.l)  =  Py(l.l)  =  l.  k>2. 


.A;  =  2, 

<i)*  =  l/2, 

5*  =  1/2, 


^  1, 

k^\, 

if  *  =  1, 
if  It  2;  2. 


Substituting  these  into  (16)  yields 

EiM„^xX,)  =  0,  Ar  =  l,--,n. 

For  n  2  1,  this  again  yields  min*  5„£(W„.nAr*)  =  0. 

These  examples  illustrate  that  it  suffices,  hence,  to 
restrict  attention  to  update  rules  that  yield  nonnegative 
covariances,  £(M,*,2f*),  for  every  k^n.  The  following 
example  illustrates  that  a  nonzero  covariance  can,  in  fact, 
be  obtained  between  a  memory  and  every  past  input  using 
a  purely  deterministic  sequence  of  update  rules. 

Example — Unbroken  Runs:  Consider  the  sequence  of 
update  rules  which  store  a  1  in  the  memory  iff  there  has 
been  an  unbroken  run  of  inputs  taking  the  value  I.  The 
update  rules  determining  Afj  and  Af*.,.,,  2  2  are  shown 

in  Figs.  3  and  4.  The  probabilities  corresponding  to  the 
update  rules  are,  hence, 

p,(-l,-l)  =  p,(l,l)  =  0, 

p,(-l,l)  =  p,(l,-l)  =  l, 

when  it  =  1,  and 

/7y(  -  1  ,  -  1  )  =  /7y(  -  1  .  1  )  =  (  1  .  1  )  =  0. 

Py{l.-l)=l.  k>2. 
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Af, 

-1  1 

-1 

1 

-1  1 

-1  1 

Fig.  3.  Unbroken  run  update  for  Mj. 


specifically,  we  do  not  change  the  current  state  of  the 
memory  if  the  current  input  matches  the  sign  of  the 
memory,  and  change  the  state  of  the  memory  probabilisti¬ 
cally  (but  with  increasing  reluctance)  in  case  of  a  mis¬ 
match  in  signs.  Estimating  the  various  parameters  gives 


2 


y>i. 

} 

1 

k>\. 

1 

*  2 

A:  ^  1, 

A'* 

Af* 

-1  1 

-  1 

1 

- 1  -1 

- 1  1 

Fig.  4.  Unbroken  run  update  for 


Evaluating  the  various  parameters  we  obtain 

<2,  ify-l, 

1, 

if  k  =  I, 
ifk^l, 


<^k 


11/2 

12-**', 


-  _/l/2. 

if  it  =  l. 

"*  \i-2-**', 

if  Ac  ^  2, 

.  / 1. 

if  *  =  1 

<f>k 

if  it  ^  2. 

Substituting  these  into  (16)  yields 

( 2-n+l 

■-1). 

Hence,  min*j.„£(A/„  +  |A’*)“  2“"*'  for  n 

if  Jfc  =  I, 
if  It  ^2. 


While  the  minimum  covariance  in  the  above  example  is 
nonzero,  it  is  still  exponentially  small.  To  obtain  some¬ 
what  larger  minimum  covariances  we  resort  to  probabilis¬ 
tic  update  rules. 

Example  —  Harmonic  Updates:  For  each  /c  i  1  we  pre¬ 
scribe  the  update  rule  by  setting  Pt(- 1,  - 1)  = 

=  0  and  p*(- 1,1)“  p*(l,  -  1)  =  \/k.  This  is  equivalent 
to  the  following  prescription; 


1)  if  AT*  =  then  set  ,  =  Af*; 

2)  if  Af*  #  A/*,  then  set 


1 

=  k>l. 


Substituting  in  (16)  yields  E(M„^  ^X^)  =  \/ n  for  k<n. 
It,  hence,  follows  that,  in  fact,  min^  ^ E( A/„ , .V* )  =  1  /n. 
Theorem  1:  For  every  positive  integer  n, 

1  2 
-<  max  min  (17) 

n  1  s  c  s  n  n 


Proof:  The  lower  bound  of  1/n  follows  immediately 
from  the  construction  of  the  harmonic  update  rule  in  the 


last  example.  For  A:  ^  1  let 

us  define 

*. 

II 

(18) 

j 

f  •I'k 

if  0  s  S  2, 

1  2  ’ 

<l'k  =  ' 

(19) 

2-ii 
1  2 

,  if  2  ^  i/ij  S  4. 

Note  that  0  ^  lAi  <  4,  so  that  the  definition  above  achieves 

^  A 

a  sort  of  “normalization”:  0  <  lA*  5  1.  An  immediate  con¬ 
sequence  of  the  definition  is  the  equality 


k^\. 


We  now  claim  that  <  2itn  for  every  positive  integer  k. 
Indeed,  we  have  from  (7)  and  (19)  that 


=  -  I)) 

;sp*(-l.l)  +  p*(-l,  -l)  +  p*(l,  -l)  +  p*(i,l) 

=  2i^t,  if0<i/i*<2. 

Also,  setting  p^UJ)  =  1  -  PkUj),  for  /  e  { -  1, 1)  and  j  e 
(- 1,1),  we  have 


-l-w*(p*(l,l)-p*(l,-l))l 
SP*(-1.-1)  +  P*(1,1)  +  P*(-1,1)  +  P*(1.-1) 
=  4-iA* 


M 


*  +  i 


-  A/*,  with  probability  \/k 

A/*,  with  probability  1  -  1/A:; 


=  2iA*,  if  2  <  lA*  <  4. 
This  proves  the  claim. 


□ 
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Now  consider  (16).  From  the  definitions  (18)  and  (19), 
the  “normalization”  of  and  the  claim, 

fl  1-7, 

j-k  +  i 

=  <^*  fl  fl  (l-'A,).  (20) 

)  —  Jlc  +  I  >*/c  +  l 

To  establish  the  validity  of  the  upper  bound  in  (17)  we 
begin  by  showing  that 

A  "  A  1 

max  min  1^*  O  (l-i/^,)^—. 
is*sn  >.*  +  1'  ''  n 

(Here  the  variables,  i^y,  take  values  in  the  closed  interval 
[0, 1],  as  previously  noted.)  For  notational  simplicity,  de¬ 
note 

=  fl  (1^*^«).  (21) 

y-*  +  i 

Consider  first  the  choice  =  l/j  for  each  j.  Direct 
substitution  yields  that  F^  =  l/n  for  each  k 
Hence,  min*^„F*  =■  \/n  for  this  choice  of  ikj.  We  now 
claim  that  we  can,  without  loss  of  generality,  consider 
only  choices  for  each  value  of  j.  To  see  this, 

assume  ij/j  <l/j  for  some  choices  of  j  £  n.  I^t  k  be  the 
largest  such  j.  We  then  have  n"„k  +  i(l  -  21s 

for  j>k,  and  <  1/A.  Hence,  min*j5„F*< 
1/n  if  there  is  any  j  £n  for  which  ik,  <  l/j. 

We  will  now  show  that,  in  fact,  max  min  F*  =  1/n,  with 
the  maximum  achieved,  as  just  seen,  for  the  choice,  - 
1/y,  for  each  j.  By  the  result  just  shown,  without  loss  of 
generality,  for  each  j  we  need  consider  only  choices  for  tkj 
in  the  closed  interval  (1/;,  Ij.  Now  consider 

<-2 

For  each  j,  we  have  l/J  <  >1/^  <1,  and  in  particular,  for 
y  ==  1  we  have  =  1^.  Hence,  we  must  necessarily  obtain 
F,  <l/n,  and  consequently  min*^„F*  <  1/n,  if  there 
exists  any  j  with  >  l/j. 

We  have,  hence,  shown  that  maxminF*  =  1/n.  From 
(20)  and  (21)  we,  then,  have 

max  min  £(Af„  +  iA’*) 

I  S*  sn 

n  2 

£2  max  min  lA*  I~[  (l-i^)*-. 

To  complete  the  proof,  we  need  to  show  that  the  upper 
bound  2/n  is  strict.  To  see  this,  note  that  m^minF*  = 
1/n  is  achieved  only  for  the  unique  choice  of  0^  =  1/y  for 
each  j  £  n.  An  examination  of  the  bounding  technique 
used  in  deriving  the  bound  of  equation  (20)  shows  that  a 
ncces-sary  condition  for  the  upper  bound  in  (17)  to  be 
realizable  is  that  S,  =  2«^^  =  2/y  for  each  j.  But  for  y  =  1 

'in  f.icl.  the  variable  ti,  appc.irs  only  in  the  expresMon  for  f,  where  il 
appears  .as  .a  produce  term.  We  can  then  maximi/e  the  value  of  F , 
without  affecting  any  of  the  other  f/s  by  setting  if/,  -  I. 


this  is  already  impossible  as  can  be  verified  from  (5)-(7), 
and  (18).  Hence,  maxmin  £(Af„  +  ,A't)<  2/n.  □ 

Remarks:  In  this  proof,  we  used  the  bound  $^£2ii/^ 
valid  for  every  k.  This  is,  however,  not  the  tightest  possi¬ 
ble  as  we  saw  above;  in  particular,  the  bound  is  not 
achievable  when  the  best  results  (the  bound  of  2/n)  are 
obtained  for  the  choice  of  parameters,  =  1/A.  A  more 
careful  analysis  should  see  improvement  in  the  upper 
bound.  (In  particular,  the  harmonic  update  rule  is  a 
persuasive  candidate  for  being,  in  fact,  the  optimal  up¬ 
date  rule.  If  true,  this  would  imply,  of  course,  that 
maxmin  £(Af„.nA't)=  1/n.) 

Note  also  that  the  proof  yields  the  following  stronger 
result;  the  same  maximin  bounds  hold  for  the  absolute 
value  of  the  covariances,  viz., 

1  ,  I  2 

—  £  max  min  |£( A/„^,A’*)1  <  — . 

n  /,.  .f^  i&k&n  n 

C.  Maximin  Mutual  Information 

Now  consider  the  problem  (2).  Here  the  maximin  prob¬ 
lem  is  to  maximize  the  mutual  information  between  past 
inputs  and  the  current  memory  state.  In  order  to  evaluate 
the  mutual  information,  X^),  for  a  general  family 

of  update  rules,  in  general,  we  have  recourse  to  Assertion 
3.  We  obtain  the  lower  bound  below  for  J„  by  maximizing 
the  minimum  mutual  information  over  a  restricted  set  of 
update  rules  where  the  probabilities  derived  in  Assertion 
3  are  somewhat  more  manageable. 


Theorem  2: 

max  min  A"*)  =  n(n"^) 

Isksn 


(n-co). 


More  specifically, 

1  (  * 

max  min  /(A/„a,;A*)>— -7— +  0  -7 

2n^ln2  ( n" 


(n-»oo). 


Proof:  Let  us  restrict  attention  to  the  family  of 
monotone  symmetric  update  rules;  p/  -  1,  -  1)  =  p/1, 1)  = 
0,  and  p/- 1, 1)  =  p/1,  -  0=  p^,  y  >  1.  For  simplicity  let 
us  denote 

n 

2k=Pk  n  (l-F;)- 
>  “  *  +  1 


From  (14)  and  (15)  we  then  have 


=  A*  =  1}  =  =  A*  =  -  1}  =  ;^(1 -I- z*). 


and 

=  1,  A*  =  -  1}  =  =  -  1,  A,  =  1) 

-\o-^k)- 


Noting  that  for  the  class  of  monotone  symmetric  up¬ 
date  rules,  the  r.v.’s  , ,  arc  symmetric,  and  take  on  the 
values  -  1  and  1  with  equal  probability  1/2,  we  have  the 
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IhOl 


following  expression  for  the  conditional  uncertainty  of  X,^ 
given 

-^(1-  2*)l0g2^(l-  2*) 


where  h  is  the  binary  entropy  function 
h(y)  =  -  y  log:  y  -(l-  y)Iog2(l  -  y),  0:5  y  ^  1. 

(As  usual,  we  define  OlogO  =  0.)  Hence, 

=  HiX,)-  = 

By  the  same  inductive  argument  used  in  establishing  the 
upper  bound  for  Theorem  1  we  obtain  that  min^^  is 
maximized  among  the  class  of  monotone  symmetric  up¬ 
date  rules  for  the  unique  choice  of  the  harmonic  update 
rule:  Pj  =  \/j  for  each  j.  For  this  choice  of  update  rule 
we  have 


2k 


1 

k 


n 


n 

;  -  k  +  I 


1 

» 

n 


k  =  1,-  • 


Using  the  monotone  decreasing  property  of  /i(y)  for 
l/2sy5l  we  have  that  min*^^ /(A/^  +  ,;  A'*)  is  also 
maximized  among  the  class  of  monotone  symmetric  up¬ 
date  rules  for  the  harmonic  update  rule.  This  estimate 
forms  a  useful  lower  bound  for  =  max  min  +  Af*). 
Hence, 

1  1 
2  2n 

The  Taylor  series  expansion  for  ln(l-i-  y),  (y(  <  1  yields 
the  required  asymptotic  form  in  the  statement  of  the 
theorem.  □ 


max  min  /( A/,  +  , ;A’t)  ^  l-/j 


Remarks:  A  general  examination  of  J„  over  all  possible 
update  rules  using  the  results  of  Assertion  3  appears 
somewhat  difficult  in  view  of  the  lack  of  symmetry  in  the 
various  probabilities.  A  reasonable  candidate  hypothesis 
may  be  that  it  suffices  to  consider  only  monotone  symmet¬ 
ric  rules — p*(- 1,  - 1)  =  p*(l,l)  =  0  and  p*(-l,l)  = 
p*(l, -l)=p*  for  each  (If  true  this  would,  of 

course,  yield  the  estimate  /,  ~  l/2n^ln2.)  As  noted  ear¬ 
lier,  this  enforces  symmetry  and  the  intuitively  appealing 
procedure  of  effecting  no  change  in  memory  state  if  the 
current  input  agrees  with  the  current  state  of  memory. 
While  it  is  relatively  easy  to  show  that  we  can,  without 
loss  of  generality,  set  p„(  - 1,  - 1)  =  p„(l,  1)  =  0,  the  proof 
does  not  seem  to  extend  simply  to  all  p*(-l,  -1)  and 
p*(l.l). 


£>.  Maximum  Average  Covariance 

J.  Komids  has  recently  communicated  to  us  results  of 
joint  work  with  L.  Rejto  and  G.  Tusnady  on  the  maximal 


expected  payoff  of  a  finite  automaton  with  binary  inputs 
(1).  Their  results  include  the  estimate  0(l/»)  for  the 
maximal  average  covariance,  K„,  which  they  obtain  using 
conditioning  on  inputs  coupled  with  an  inductive  argu¬ 
ment.  We  show  this  estimate  here  as  an  (almost)  direct 
consequence  of  the  proof  of  Theorem  1. 

Theorem  3:  For  every  positive  integer  n, 

1  1  A  2 

-5  max  -  E  £(A/„  +  ,A't)  <-. 
rt  A.' ■•./,  ^  k - 1  ^ 

Proof:  The  lower  bound  follows  from  the  lower  bound 
for  /„.  Now  consider  (21).  Writing  =  F*  „  explicitly  as  a 
function  of  n,  we  have 

=  n  (l^A:<n,/j  =  l,2,-)- 

7  -  k  +  1 

Recall  from  equation  (19)  that  0  <  <  1  for  every  j,  and 

that  S,  depends  solely  on  j  and  not  on  n.  Now  form  the 
sequence  of  sums,  {S„),  by  setting 

k  -1 

Noting  that 

ifl5A5«-l, 

we  have 

As  0  ^  5,  =  1^1  5  1,  an  easy  inductive  argument  shows  that 
S„  is  an  iteration  of  convex  combinations  of  numbers  less 
than  one,  so  that  5,5  1.  From  (20)  and  the  concluding 
remarks  of  the  proof  of  Theorem  1,  we  have 

£(A/,  +  ,Af*)<2F*,„ 

so  that 

1  n  5  2 

max  -  E  ^(^n  +  i^k)  <  2  max  ■7^-- 

^  Awl  ^  ^ 

This  completes  the  proof.  □ 

Remarks:  In  fact,  this  convex  combination  argument 
can  be  used  in  lieu  of  the  argument  presented  in  the 
proof  of  Theorem  1.  Note  also  that  the  bound  of  (20)  is 
easily  improved  to  l£(Af„  +  ,A'*)|  <  2F4  „.  The  proof  of 
Theorem  3  then  yields  the  stronger  result 

-5  max  -  E  |£(A/„  +  |Af*)|<  -  (n^l). 

A*'  ■  "A  k  - 1  ^ 

Substantial  improvements  in  the  maximum  average  co- 
variance  may  be  obtained  if  memory  updates  are  allowed 
access  to  all  past  inputs  (and  not  just  the  last  input).  Let 
y"  denote  the  family  of  all  (probabilistic)  functions  map¬ 
ping  {-  1, 1}"  info  (-1, 1). 
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Theorem  4:  For  every  positive  integer  n, 

max  - 

«  t_i 

<max  -  E  E(XJiX„-  -,X„)) 

*-i 


JL 

'jtrn 


Proof:  The  first  inequality  is  immediate.  Now,  for  any 
/  e  5^,  we  have 

*-i  \  *-i  ' 


Zx, 

k-‘\ 


(as  f^X^,^  ■  •,  A'„)e{-  1,1)),  with  equality  if  /  is  chosen 
to  be  the  majority  function;  for  any  choice  of  Boolean 
variables  x,,- •  x„  e{  -  1, 1)  let  N*  denote  the  number 
of  variables,  x,,  that  take  the  value  + 1,  and  let  N~  =  n- 
N*  denote  the  number  of  variables,  Xy,  that  take  the 
value  -  1;  we  define  the  majority  function,  /^'(x,,-  •  -.x^), 
by 

Let  us  denote  by  S„  the  random  walk 


Zx,. 


k-\ 


We  then  have 


maxi  ZE{XJ{X„-  -,X„)) 

=  i£(l5J) 

1  1^1  rnt 


'r 

i-n 

wt 

n 

.2. 

, 

)  -n  I 


with  the  last  equality  following  by  the  application  of 
standard  binomial  identities.  An  application  of  Stirling’s 
formula  now  yields  the  required  result.  □ 

The  average  covariance  cannot,  hence,  exceed  the  or¬ 
der  of  1/Vn  even  if  we  allow  (binary)  update  rules  with 
unlimited  access  to  past  history. 

III.  Related  Issl'es 

Thus  far.  we  have  been  mainly  concerned  with  update 
rules  with  two  Boolean  arguments  and  producing  one 
Boolean  variable.  The  state  of  memory  at  epoch  n  +  1  is. 


hence,  a  Boolean  function  of  n  Boolean  variables  (the 
inputs,  AT,,  -  •  •,  A(„)  taken  in  sequence  and  passed  through 
a  cascade  of  Boolean  functions  of  two  Boolean  variables. 
A  natural  question  that  arises  is  how  many  deterministic 
Boolean  functions  of  n  variables  can  be  constructed  in 
this  fashion  out  of  the  total  of  2^"  Boolean  functions  of  n 
variables? 

Let  g*;  { - 1, l)^  -* { - 1, 1),  k  ^2  denote  a  sequence  of 
(deterministic)  Boolean  functions  of  two  Boolean  vari¬ 
ables.  We  recursively  form  a  sequence  of  Boolean  func¬ 
tions  of  k  Boolean  variables,  /*:  {- 1, 1}* -►{  - 1,1),  for 
k  ^2,  as  follows; 


f2(X„X,)  =  g,(X„X,), 

fk(Xi,' ' '  <  Xi_i, X^)  =  gk{fk-\{  X,,-  ■  ■ ,  Xk_,),  Xk) 

(k^3). 

Let  ^k  denote  the  family  of  all  (deterministic)  Boolean 
functions  of  k  Boolean  variables,  /*,  constructed  recur¬ 
sively,  for  every  choice  of  functions  g*. 

Theorem  5: 

2  8 

ns2. 

Remark:  In  fact,  it  is  easy  to  see  that  2"  ^  15^1  ^  16". 
Clearly,  this  count  falls  far  short  of  the  2^"  possible 
Boolean  functions  of  n  Boolean  variables. 

Proof:  The  demonstration  is  inductive  in  nature.  For 
n  =  2  we  clearly  have 

=  16, 

as  there  are  2*  Boolean  functions  of  two  Boolean  vari¬ 
ables.  Now,  for  n  >  3  we  claim  the  following  recursion 
holds; 


I  \ 

-^-lj  =  61iF_,|-8. 

To  establish  this  it  is  helpful  to  consider  the  table  of  all  16 
Boolean  functions  of  two  Boolean  variables,  X  and  V, 
illustrated  in  Fig.  5.  Note  that  two  of  the  possible  func¬ 
tions  (the  first  row)  are  the  constant  functions,  which 
depend  on  neither  X  nor  Y,  and  that  two  more  functions 
(the  second  row)  depend  only  on  X  and  not  on  Y.  All  the 
remaining  12  functions  depend  explicitly  on  Y.  Let  us  call 
a  set  of  Boolean  functions  independent  if  no  function  in 
the  set  is  the  complement  of  another  function  in  the  set. 
Now,  by  symmetry,  the  complement  of  every  function  in 
is  also  in  5^_|.  Hence,  we  can  find  a  maximal  set 
oL^^f\/2  independent  functions  in  5^_|.  Clearly,  one 
orthese  functions  is  the  constant  function  so  that  there 
are  |,5^_il/2-l  functions  in  a  maximal  set  of  indepen¬ 
dent  functions  in  .5^  _  i  which  depend  explicitly  on  one  or 
more  of  the  variables  A", ,  ,  A'„  _ , . 

Now  consider  functions,  g„(/„  ^  /  A',,  •  •  • ,  A'„  |).  A'„). 
Let  us  identify  with  X„  the  variable  A'  and  with 
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Fig.  5.  A  labulalion  of  ihe  16  poisible  Boolean  fiinclions  of  two 
Boolean  variables,  ,V  6  ( -  1, 1)  and  V  6  ( -  1, 1).  The  first  column  enu¬ 
merates  a  set  of  eight  distinct  Boolean  functions  of  these  two  variables, 
none  of  which  is  a  complement  of  another  function  in  the  column.  The 
second  column  lists  the  complements  of  the  functions  listed  in  the  first 
column;  (each  row’  gives  a  function  and  its  complement.)  We  use  the 
notation  to  denote  complement  (logical  NOT),  a  to  denote  conjunc¬ 
tion  (logical  AND),  and  v  to  denote  disjunction  (logical  OR). 


that  wc  need  to  eonsider  only  momnonc  s\mmeiiic  up¬ 
date  rules.  As  noted  earlier,  if  this  eonjeeturc  holds  true, 
then  /„=  1/n  and  J„  -  l/2/i-ln2  with  equality  holding 
in  both  cases  for  a  choice  of  the  harmonic  update  rule. 

Another  extension  of  the  problem  is  to  consider  input 
sequences  drawn  from  nonsymmctric  Bernoulli  trials,  and 
in  general,  i.i.d.  inputs  Xt.  k  >  1  drawn  from  a  distribu¬ 
tion  on  the  real  line  (with  a  suitable  second  moment 
constraint).  The  maximin  problem  with  one  or  more  bits 
of  available  memory  is  open  for  this  case. 

The  maximin  problem  analyzed  here  has  implications 
to  questions  on  the  information  storage  capacity  of  neural 
networks.  A  formal  McCulloch  -  Pitts  neuron  is  character¬ 
ized  by  n  real  weights,  w,,-  -.w,,;  it  accepts  n  binary 
inputs,  H,,  -  •  -  .n^  e{ -  1. 1)  and  produces  a  binary  output 
(•  e  { -  1, 1)  according  to  the  threshold  rule 

n 

if  E 

/-I 

n 

if  E  ^0- 


-1. 


r  = 


variable  Y  in  the  table  of  Boolean 
functions  of  two  Boolean  variables.  Each  of  the  indepen¬ 
dent,  nonconstant  functions,  Y,  in  yields  12  distinct 
functions  depending  explicitly  on  Y  in  as  can  be 
verified  from  Fig.  5.  (By  symmetry,  the  complement,  Y,  of 
each  independent,  nonconstant  function  Y  in  ,  yields 
the  same  set  of  12  distinct  functions  as  docs  Y.)  There 
are,  hence,  12(15^. ,1/2 - 1)  distinct  functions  in  ^  that 
depend  explicitly  on  one  or  more  of  the  variables 
AT,,-  •  Adding  in  the  four  functions— the  two  con¬ 

stant  functions,  and  the  functions  returning  the  values  X„ 
and  Ar„— which  are  independent  of  the  variables 
Af,,- •  completes  the  count.  □ 

A  natural  extension  to  the  maximin  problem  is  to 
consider  how  much  information  can  be  stored  about  the 
past  if  now  (say)  m  ^  1  bits  of  memory  are  available.  This 
issue  is  still  open.  The  simple  strategy  of  interleaving  the 
input  sequence  across  the  memory  bits  (equivalently,  par¬ 
titioning  the  input  sequence  into  m  equal  length  subse¬ 
quences  and  apportioning  one  bit  of  memory  to  each 
subsequence),  for  instance,  effectively  reduces  the  prob¬ 
lem  to  a  one  bit  memory  problem  with  an  equivalent 
“reduced  sequence  length”  of  n/m.  With  the  mutual 
information  measure,  for  instance,  if  m  bits  are  available 
for  the  memory,  we  have 

/  1  \ 

supmin /( Af*)  s  ^  J . 

Another  approach  giving  the  same  results  is  to  update 
each  bit  of  memory  independently.  Substantial  improve¬ 
ments  over  these  straightforward  gains  may.  however,  be 
possible  if  more  complex  update  strategies  arc  used. 

The  tightening  of  the  information  bounds  shown  in  the 
previous  section  is  open.  Spceifieally,  it  appears  plausible 


In  a  network  of  formal  neurons  information  ean  be  re¬ 
garded  as  being  stored  in  the  weights.  If  the  weights  are 
allowed  to  range  over  only  a  finite  set  of  values,  a  cogent 
question  is  /low  much  information  is  stored  per  bit  of 
weight  ? 

As  a  specific  instance,  consider  a  classification  problem 
on  vertices  of  the  n  cube.  Let  u',  -  ■  e  ( -  1,  l)"  be  m 
randomly  chosen  patterns  (with  components  drawn  from 
symmetric  Bernoulli  trials).  Let  j!/in,m)  denote  the  at¬ 
tribute  (of  the  m-set  of  patterns)  that  there  is  a  choice  of 
weight  vector,  w,  such  that  (*•>,«’)  >  0,  <7  =  !,•  •  (Al¬ 
ternatively,  ^{n,m)  is  the  attribute  that  a  formal  neuron 
classifies  each  of  the  patterns  properly.)  We  say  that  C„  is 
a  capacity  function  for  the  attribute  si^{n,m)  if,  for  every 
A  >  0,  as  n  -**: 

a)  P[js/(,n,m)]-*l,  ifm<(l-A)C,; 

b)  P{J3/{n,m)) -*0,  ifms(l  +  A)C„. 

The  capacity  function  specifies,  in  a  sense,  the  largest  size 
of  random  problem  that  can  be  reliably  done  by  a  linear 
threshold  element  or  formal  neuron.  Equivalently,  it  can 
be  thought  of  as  specifying  the  maximum  amount  of 
information  that  can  be  reliably  stored  in  the  weights. 
This  interpretation  is  particularly  persuasive  when  the 
neural  weights  are  constrained  to  be  binary.  In  this  case, 
each  weight,  6{-  1,1),  has  to  store  information  about 
the  yth  component  of  each  pattern, 

u;.--,u;e{-i,i}, 

so  that  the  information  stored  per  bit  of  weight  is  directly 
related  to  the  capacity.  In  this  form  the  problem  can  be 
seen  to  be  strongly  related  to  the  maximin  problem  wc 
have  analyzed  here.  A  rigorous  analysis  shows  that  the 
capacity  of  a  neuron  with  binary  weights  is,  in  fact,  linear 
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in  n?  In  a  succeeding  paper,  we  illustrate  how  the  ideas 
developed  in  this  paper  can  be  used  in  the  training  of 
formal  neurons  with  binary  weights,  and  provide  rigorous 
capacity  calculations  [4], 
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Abstract 

Learning  real  weights  for  a  McCulloch-Pitts  neuron  is  equivalent  to  linear  programming  and 
can  hence  be  done  in  polynomial  time.  Efficient  local  learning  algorithms  such  as  Perceptron 
Learning,  further,  guarantee  convergence  in  finite  time.  The  problem  becomes  considerably 
harder,  however,  when  it  is  sought  to  learn  binary  weights;  this  is  equivalent  to  integer  pro¬ 
gramming  which  is  known  to  be  NP-complete.  A  family  of  probabilistic  algorithms  which  learn 
binary  weights  for  a  McCulloch-Pitts  neuron  with  inputs  constrained  to  be  binary  is  proposed 
here,  the  target  functions  being  majority  functions  of  a  set  of  literals.  These  algorithms  have  low 
computational  demands  and  are  essentially  local  in  character.  Rapid  (average-case)  quadratic 
rates  of  convergence  for  the  algorithm  are  predicted  analyticadly  and  confirmed  through  com¬ 
puter  simulations  when  the  number  of  examples  is  within  capacity.  It  is  also  shown  that  for 
the  functions  under  consideration,  Perceptron  Learning  converges  rapidly  (but  to  an,  in  general, 
non-binary  solution  weight  vector). 


1  INTRODUCTION 

We  consider  learning  in  the  context  of  linearly  separable  functions.  Given  an  arbitrary  linearly 
separable  dichotomy  of  a  finite  set  of  patterns,  the  Perceptron  Training  Algorithm  [1]  guarantees 
convergence  in  finite  time  of  an  iteratively  updated  sequence  of  weight  vectors  to  a  real  solution 
vector  which  separates  the  dichotomy.  The  problem  becomes  considerably  harder,  however,  when 
we  are  required  to  learn  binary  weights  for  a  linearly  separable  problem.  The  problem  of  learning 
real  weights  for  a  McCulloch-Pitts  neuron  is  equivalent  to  linear  programming,  for  which  there 
exist  polynomial  time  algorithms.  (The  Perceptron  Learning  Rule  is  an  on-line  procedure  which, 
as  we  wiU  see  in  the  sequel,  can  converge  extremely  rapidly  under  moderate  conditions.  Similar 
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conclusions  are  also  reported  by  Baum  [2]  under  slightly  different  hypotheses.)  Learning  binary 
weights  for  a  McCulloch-Pitts  neuron  is,  however,  equivalent  to  integer  programming,  which  is 
known  to  be  NP-complete  [3]. 

Notwithstanding  the  apparent  difficulty  of  the  learning  problem  in  this  case,  the  potentially 
lower  cost  and  simplicity  of  neural  networks  comprised  of  binary  interconnections  as  opposed  to 
real  weights  makes  such  circuits  rather  appealing  practically.  Recent  theoretical  results  also  bolster 
the  usage  of  such  circuits:  the  computational  capacity  of  a  neuron  with  binary  weights  remains 
comparable  to  that  of  a  neuron  with  real  weights  [4,  5].  The  development  of  efficient  heuristics 
for  learning  binary  weights  (paralleling  the  development  of  such  algorithms  as  backpropagation  for 
neural  circuits  with  real  interconnections)  is,  hence,  critical  if  the  cost  advantages  promised  by 
binary  circuits  are  to  be  realised. 

We  present  here  a  new  family  of  probabilistic  algorithms  which  learn  binary  weights  for  a 
neuron  in  an  on-line  setting.  The  target  functions  here  are  weighted  linear  threshold  functions  with 
weights  from  {-1, 1}  which  are  defined  on  a  domain  of  binary  n-tuples,  {-1, 1}".  In  particular,  the 
target  functions  are  majority  functions  of  a  set  of  literals,  that  is,  majority  functions  that  may  have 
any  of  their  inputs  complemented.  Given  a  partial  Boolean  function  defined  on  a  subset  of  m  points 
(patterns)  from  this  class,  or  alternatively,  given  a  dichotomy  of  m  points  in  {—1, 1}"  which  can  be 
linearly  separated  with  weights  from  {—1,1},  the  randomised  algorithm  described  here  iteratively 
adjusts  the  weights  until  a  solution  (binary)  weight  vector  which  separates  the  dichotomy  is  found. 
The  principal  advantage  the  proposed  algorithm  has  over  Perceptron  Learning  is  that,  not  only 
is  the  solution  vector  generated  by  the  procedure  binary,  but  the  weights  remain  confined  to  the 
domain  (-1, 1}  throughout  the  entire  learning  process.  The  algorithm,  as  we  will  see,  converges 
rapidly  to  a  solution  when  the  number  of  patterns  to  be  dichotomised  is  within  the  computational 
capacity  of  the  neuron.  An  interesting  feature  of  the  algorithm  is  that  it  is  local,  which  makes  it 
appealing  from  an  implementation  perspective. 

In  the  next  section  we  set  up  the  learning  protocol  and  describe  the  algorithm.  We  derive  some 
preliminary  results  on  the  expected  time  of  first  passage  of  random  walks  to  given  boundaries  in 
Section  3.  In  Section  4  we  analyse  the  algorithm  and  show  quadratic  initial  rates  of  convergence 
when  the  number  of  training  examples  is  within  the  computational  capacity  of  the  threshold  ele¬ 
ment.  The  analysis  here  is  for  the  average  case.*  We  also  compare  the  results  obtmned  with  the 
rate  of  convergence  of  the  Perceptron  IVaining  Algorithm;  we  show  that  the  Perceptron  Algorithm 
converges  in  the  worst  case  with  a  mistake  bound  O(n^)  to  a  re^  solution  vector  under  the  con¬ 
straint  that  there  exists  a  binary  solution  vector  within  the  solution  space.  We  also  present  an 
average  case  analysis  of  a  modification  of  the  Perceptron  Learning  Algorithm,  in  the  spirit  of  the 
proposed  Directed  Drift  Algorithm,  wherein  a  single  weight  component  is  updated  at  a  time.  In 
Section  5  we  present  simulations  and  discussions  of  the  algorithm. 

On  notation:  We  will  use  the  symbol  IB  to  denote  the  set  {-1,1}.  If  x  =  (ii,...  ,Xn) 
and  y  =  (yi,...  ,yn)  are  points  in  real  Euclidean  n-space,  we  denote  by  (x,y)  the  inner- product 
53j=i*jyj.  Following  J.  Riordan  we  use  the  word  epoch  to  denote  points  on  the  time  axis.  A 
physical  weight  update  may  take  some  time,  but  we  will  assume  updates  are  timeless  and  occur 
at  epochs.^  We  define  the  function  sgn  :  IR  -♦  B  by  sgni  =  x/|®|  if  i  ^  0  and  sgnO  =  0.  All 
logarithms  in  the  exposition  are  to  base  e.  Finally,  if  {in}  and  {yn}  are  positive  sequences,  we 

’Directed  Drift  is  s  randomised  algorithm,  and  arbitrarily  bad  worsUcase  results  are  possible.  The  probability  of 
such  occurences  is  small,  however,  and  is  governed  by  the  extreme  tails  of  the  underlying  probability  distribution. 

'In  his  text,  W.  Feller  crediU  J.  Riordan  with  initiating  the  usage  of  the  word  epoch  in  such  situations  [7,  page 
73]. 


Venk&tesb 


3 


denote:  i„  =  0(yn)  if  there  is  a  positive  constant  K  such  that  Xn/yn  <  if  for  all  n;  i„  ~  j/„  if 
Xn/Vn  1  as  n  00. 

2  LEARNING 

2.1  The  Setting 

We  are  given  a  set  of  patterns,  U  C  IR”,  and  a  function  f  :U  -*  T8  which  is  linearly  separable: 
specifically,  there  exists  a  solution  weight  vector,  w*  6  H”,  such  that 

sgn{<w',u)}  = /(u)  (1) 

for  every  choice  of  pattern  u  €  ZY.  We  call  the  function  /  the  target  function  (also  known  as  the 
target  concept  in  the  literature  on  Learning  Theory).  The  target  functions  are,  hence,  majority 
functions  of  a  set  of  literals.  Clearly,  /  realises  a  dichotomy  of  U.  Without  loss  of  generality  we 
assume  that  /(u)  =  1  for  every  pattern  u  €  U} 

An  algorithm  for  learning  from  examples  is  a  procedure  where  learning  takes  place  in  a  sequence 
of  trials.  The  protocol  is  as  follows: 

1°  At  epoch  t  the  system  is  characterised  by  a  weight  vector,  w[t],  and  receives  an  example  pattern, 
u[t],  drawn  from  U. 

2®  The  system  produces  a  response,  -1  or  1,  according  to  the  sign  of  (w[t],  u[t]). 

3®  A  new  weight  vector,  w[t  +  1],  is  generated  based  on  the  current  response,  weight  vector,  w[f], 
and  example,  u[t]. 

The  procedure  is  carried  out  iteratively,  and  is  terminated  if  a  solution  weight  vector  is  obtained. 
We  call  the  sequence  of  examples,  the  training  sequence,  and  the  sequence  of  weight 

vectors,  the  learning  sequence.  If  the  procedure  terminates  in  a  finite  time,  we  say  that 

the  learning  algorithm  has  learnt  the  function  /.  We  will  be  interested  in  the  mistake  bound — the 
number  of  classification  mistakes  the  learning  algorithm  makes  on  the  set  of  examples  before  it 
learns  the  given  function.  For  our  purposes,  the  mistake  bound  is  equal  to  the  number  of  updates 
of  the  weight  vector  before  the  function  is  learnt.  We  denote  the  mistake  bound  by  T. 

In  the  sequel,  we  will  further  restrict  the  set  of  patterns,  K,  to  be  drawn  from  the  vertices  of 
the  n-cube,  IB”,  and  require  that  there  is  a  binary  solution  weight  vector,  w*  G  IB",  for  /.  A  fourth 
restriction  that  we  will  require  of  any  binary  learning  algorithm  is: 

4®  The  initial  choice  of  weight  vector,  w[l],  is  arbitrary,  subject  only  to  its  being  chosen  from  IB”, 
and  the  learning  algorithm  generates  binary  weight  vectors,  w[f]  €  IB”,  at  each  epoch  of  the 
learning  process. 

We  will,  thus,  be  constrained  to  looking  at  algorithms  which  make  only  bit  changes  in  the  weight 
vector  at  each  epoch.  Specifically,  the  weights  are  confined  to  the  domain  {-1,1}  throughout 
the  learning  process.  This  situation  may  be  compared  to  Perceptron  Learning,  where  the  weights 
typically  grow  in  magnitude  during  the  learning  process. 

*If  /(u)  =  —1  then  /(— u)  =  1  as  can  be  easily  seen  froni  (1).  Replacing  each  pattern  in  U  for  which  /(u)  =  —1 
by  — u  we  obtain  a  corresponding  set  of  patterns  it-,  if  w*  is  any  solution  weight  vector  separating  the  dichotomy  of 
U  specified  by  /  then  all  patterns  in  0  lie  on  the  same  (positive)  side  of  the  hyperplane  corresponding  to  w*,  and 
conversely. 
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2.2  Directed  Drift  Algorithms 

We  present  here  a  family  of  probabilistic  algorithms  for  binary  learning.  We  call  these  algorithms 
Directed  Drift  Algorithms  because,  as  we  shall  see,  they  share  some  similarities  with  asymmetric 
random  walks  with  a  preferred  direction  toward  a  solution. 

Let  U  be  any  subset  of  patterns  from  B",  and  let  {u[t]}  be  any  training  sequence  such  that 
each  of  the  patterns  in  U  appears  infinitely  often.^  Let  {w[t]}  denote  a  binary  learning  sequence. 
For  each  epoch,  t,  we  denote  by  J[t]  the  subset  of  indices  for  which  the  corresponding  components 
of  w[t]  and  u[t]  are  opposite  in  sign: 


Single  bit  updates  We  begin  with  the  simplest  version  of  the  algorithm  where  no  more  than  a 
single  component  of  the  weight  vector  is  updated  per  epoch. 


base:  w[1]  e  IB"  is  chosen  arbitrarily. 

iteration:  The  algorithm's  response  is  predicated  upon  whether  a  correct  or  incorrect 
response  is  obtained  at  the  current  epoch,  t. 

•  If  (w(t],u[t])  >  0,  then  the  weight  vector  is  left  unchanged:  w[t  +  1]  =  w[t]. 

•  If  (w[t],  u[t])  <  0,  then  an  index  j[t)  is  picked  at  random  from  the  set  of  indices, 
J[t],  of  mismatched  components.  The  new  weight  vector  is  now  formed  according 
to  the  following  rule: 


Wjlt  + 1]  = 


( 


if  j  ^  j[t] 
if  j  =  j[t]. 


(2) 


The  intuition  behind  the  algorithm  is  as  follows.  If  a  binary  solution  vector,  w*  G  IB",  exists, 
then  necessarily  we  must  have  (w*,u)  =  >  0  for  each  pattern  u  G  W.  As  there  is  a 

contribution  of  +1  to  the  sum  if  two  corresponding  components  of  w'  and  u  have  the  same  sign, 
and  -1  if  the  signs  are  mismatched,  it  follows  that  the  binary  solution  vector  has  more  component 
sign  matches  than  mismatches  with  each  pattern  in  It. 

Now  the  algorithm  updates  the  current  estimate  of  the  weight  vector  if  and  only  if  the  current 
pattern  from  the  training  sequence  is  misclassiiied.  A  weight  vector  update  results  in  a  randomly 
chosen  mismatched  component  of  the  weight  vector  being  flipped  to  the  sign  of  the  corresponding 
pattern  component.  Since  there  is  a  probability  better  than  a  half  that  a  randomly  specified 
component  of  any  pattern  has  the  same  sign  as  the  corresponding  component  of  the  binary  solution 
vector,  it  follows  that  at  each  epoch  the  a  priori  probability  that  the  weight  vector  update  is  in 
the  direction  of  the  binary  solution  vector  is  better  than  a  half.  We  will  explore  this  more  formally 
in  the  sequel. 


Several  bit  updates  The  algorithm  can  be  simply  extended  to  accommodate  more  than  a  single 
bit  update  per  epoch.  Let  {At}  be  a  sequence  of  integers  with  0  <  At  <  n/2. 

•Note  that  ^  C  IB"  is  a  finite  set  of  patterns.  If  W  =  {u*,...  .u"*}  is  an  m-set  of  patterns,  then  we  can,  for 
instance,  obtain  valid  training  sequences  by  cycling  through  the  patterns  or  choosing  a  pattern  randomly  at  each 
epoch. 
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base;  w[1]  6  IB”  is  chosen  arbitrarily. 

iteration;  As  before,  updates  are  made  only  if  the  current  pattern  from  the  training 
sequence  is  misclassi/ied. 


•  If  (w[t],u[t])  >  0,  then  the  weight  vector  is  left  unchanged:  w[<  + 1]  =  w[<]. 

•  If  (w[t],  u[t])  <  0,  then  Nt  indices  ji[t],  . . .  ,  jjv,  W  are  picked  at  random  from  the 
set  of  indices,  J[<],  of  mismatched  components.  The  new  weight  vector  is  now 
formed  according  to  the  following  rule: 


Wj[t  +  1]  = 


{ 


Wj[t] 

-Wj[t] 


JO'  €  {jlW,...  jN,[t]}- 


The  sequence  Nt  specifies  the  number  of  bits  to  be  changed  at  each  update  epoch,  and  the 
proper  choice  of  this  sequence  is  clearly  critical  to  the  functioning  of  the  algorithm.  This  is  analogous 
to  choosing  an  appropriate  cooling  schedule  for  simulated  annealing  [6]. 


2.3  Perceptron  Training  Algorithm 

A  geometrical  appreciation  of  the  Directed  Drift  Algorithm  can  be  obtained  from  a  consideration 
of  the  classical  Perceptron  Training  Algorithm.  Let  {u[t]}  be  a  training  sequence  of  patterns,  and 
let  {w[/]}  denote  a  learning  sequence  of  real  weight  vectors. 

Fixed  increment  Perceptron  Training  This  is  the  simplest  form  of  Perceptron  learning.  Let 
/?  >  0  be  fixed. 

base;  The  initial  choice  of  weight  vector  is  arbitrary.  For  simplicity  we  talcew[l]  =  0. 
iteration;  As  before,  weight  vector  updates  are  made  only  if  a  pattern  is  misclassihed. 

•  If  (w[t],u[t])  >  0,  then  the  weight  vector  is  left  unchanged:  w[t  +  1]  =  w[t]. 

•  If  (w[t],  u[t])  <  0,  then  set  w[t  +  1]  =  w[t]  +  /9u[t]. 

The  Perceptron  Training  Algorithm  is  known  to  converge  to  a  real  solution  vector  (if  it  exists) 
in  finite  time  [1].  Geometrically  speaking,  the  situation  is  as  depicted  in  Figure  1(a).  When  a 
pattern  from  the  training  sequence  is  misclassified  by  the  current  estimate  of  the  weight  vector,  the 
weight  vector  update  is  in  the  direction  of  the  misclassified  pattern.  This  idea  of  updating  in  the 
direction  of  the  misclassified  pattern  is  extended  in  the  Directed  Drift  Algorithms.  The  situation 
is  as  depicted  schematically  in  Figure  1(b).  The  updates,  being  constrained  to  be  binary,  are  not 
directly  in  the  direction  of  the  misclassified  pattern;  nevertheless,  the  update  lies  in  the  positive 
half  space  corresponding  to  the  binary  pattern  vector  so  that  the  updated  weight  vector  is  more 
apt  to  classify  the  pattern  correctly. 


Single  component  Perceptron  Training  The  basic  randomisation  idea  behind  single  bit  up¬ 
date  Directed  Drift  is  easily  extended  to  single  component  Perceptron  Learning,  where  a  single 
component  of  the  weight  vector  is  modified  at  each  update  epoch  (as  opposed  to  traditional  Per¬ 
ceptron  Learning  where  all  components  are  modified  at  each  update  epoch). 

For  each  epoch,  t,  let  /[t]  denote  the  subset  of  indices  for  which  the  corresponding  components 
of  w[t]  and  u[t]  are  opposite  in  sign: 

/[f]  =  {  * :  u,[f]  #  sgn  ii>,[fj }. 
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base;  For  simplicity,  take  w[l]  =  0. 

iteration:  The  algorithm's  response  is  predicated,  as  usual,  upon  whether  a  correct 
or  incorrect  response  is  obtained  at  the  current  epoch,  t. 


•  If  {w[i],u(<])  >  0,  then  the  weight  vector  is  left  unchanged:  w[t  +  1]  =  w[i]. 

•  If  (w[i],u[t])  <  0,  then  an  index  t[t]  is  picked  at  random  from  the  set  of  indices, 
/[i],  of  mismatched  components.  The  new  weight  vector  is  now  formed  according 
to  the  following  rule: 


tn,[t  +  1]  = 


{ 


if  i  ^  i[t] 
if  i  =  i[t]. 


(3) 


In  this  variation,  a  single  bit  is  added  (subtracted)  from  a  randomly  chosen  component  at  each 
update  epoch.  The  mistake  bound  hence  coincides  with  the  number  of  component  updates,  as  in 
single  bit  update  Directed  Drift.  For  fixed  increment  Perceptron  Learning,  of  course,  the  number 
of  component  updates  is  n  times  the  mistake  bound. 


3  RANDOM  WALKS 


An  estimate  of  the  rates  of  convergence  of  the  randomised  algorithms  described  above  may  be 
obtained  by  appealing  to  notions  from  random  walks  and  the  geometric  theory  of  paths. 

Let  {Xj}  be  a  sequence  of  Bernoulli  trials  with  success  probability  pn  =  l/2+/3n,  0  <  fin  <  1/2 
depending  on  a  parameter  n: 


f  1  with  probability  pn  =  1/2  +  /?n 
(  —  1  with  probability  gn  =  1/2  —  /3n 


Let  Sk  =  HjzsiXj  denote  a  random  walk  with  positive  drift,  E5jt  =  2fc/3„.  We  are  interested 
in  estimating  the  expected  time  of  first  passage  of  the  random  walk  to  some  specified  boundary, 
B{n,k). 


3.1  Fixed  Boundary 

Let  us  first  consider  the  case  of  a  fixed  (one-sided)  boundary  at  n.  Define 

T](n)  =  inf{k  :  Sk  >  n). 

For  this  case,  the  theory  of  generating  functions  can  be  readily  invoked  to  estimate  the  expected 
time  of  first  passage  to  the  boundary  (cf.  Feller  [7,  Chapter  III]).  We  have  the  following  estimate: 

Proposition  3.1  ETj(n)  =  n/2^„  for  every  n. 

Proof:  Let  a/(Jt)  denote  the  probability  that  the  random  walk  makes  a  first  passage  I  units  to 
the  right  of  the  starting  point  in  k  steps.  We  then  have  ETi(n)  =  Yll^o^^n{k). 

As  a  first  passage  through  n  must  necessarily  involve  a  prior  first  passage  through  n  -  1,  we 
immediately  have  the  convolutional  relation 

fc-i 

On{k)  =  53  a„_i(j)ai(k  -  j). 


(4) 


+ 


-7  E  [{iS»p,(-l.l)-u,*p,(l.-l)t 

^  j-*  +  l 

-{5*p,(-l.-l)-a-.p, (1.1)11  n  ,(‘-7)' 
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Let  At{s)  denote  the  generating  function  for  the  probability  distribution  ai{k);  i.e., 

Ms)  =  a,(k)s'‘. 

fc=0 

From  equation  (4)  we  hence  have 

A„(a)  =  A„_,(s)Ai(a)  =  [Ai(s)r,  (5) 

with  the  latter  equality  following  by  induction. 

To  evaluate  Ai(a)  we  need  to  evaluate  the  probabilities,  ai(fc),  of  a  first  transition  one  unit  to 
the  right  in  k  steps.  We  note  that 

ai(0)  =  0  (6) 

«i(l)  =  Pn-  (7) 


Now,  a  first  transition  one  unit  to  the  right  in  ib  >  2  steps  must  necessarily  involve  an  initial  step 
to  the  left,  followed  by  a  first  transition  one  unit  to  the  right  (back  to  the  origin),  and  a  final  first 
transition  one  more  unit  to  the  right.  Hence 

k-2 

=  k=2,3 .  (8) 

;=i 

Using  equations  (6),  (7),  and  (8)  we  now  have 

^i(^)  =  ^ai(k)s^ 

Jfc=0 

oo 

=  PnS  +  Y^  0‘i(k)s'‘ 

fc=2 

oo  fc— 2 
lt=2  j=l 

=  PT.S  +  9n«l'4l(«)]^. 


Solving  for  Ai(s)  we  finally  obtain 

Ai(s) 


1  -  yi  -  ‘lp„qnS^  ^ 
2q„s 


Substituting  in  (5)  we  obtain 

^n(5)  =  = 

A=0 

We  can  now  directly  compute  the  expected  time  of  first  passage  n  units  to  the  right  by 

Er,(n)  =  A;(l)  =  — 2— . 

Pn  ■“  9n 

The  substitution  =  2/?n  completes  the  proof. 


1  -  y/l  -  4pnqn^^ 
2g„s 


^We  discard  the  positive  root  as  it  grows  without  bound  as  s  — »  0,  and  we  require  y4i(0)  =  0. 
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3.2  Receding  Boundary 

We  now  consider  the  case  of  a  receding  (two-sided)  boundary  at  Define 

T2(\/^)  =  inf{fc  :  15fc|  >  VXn}. 


It  is  difficult  to  get  explicit  closed-form  expressions  when  the  boundary  is  not  fixed,  and  we  will 
be  satisfied  with  an  asymptotic  estimate  for  ET2(Vkn)  as  n  —*  oo.  Our  development  follows 
Siegmund  [8,  Chapter  IX]  where  a  general  analysis  is  presented  in  the  context  of  nonlinear  renewal 
theory. 

Now  TiiVkii)  (if  finite)  is  the  first  integer  k  for  which  the  random  variable  Sl/k  exceeds  n. 
Now,  simple  algebraic  manipulations  yield 


S  =  4k0l  +  4^„(5i  -  2W„)  +  . 

*  ■ - 7 - '  ^ - - - ■ 

By  the  strong  law  of  large  numbers  we  have 

— ►  4/3^  with  probability  one 

0  with  probability  one. 
Further,  as  an  easy  consequence  of  Kolmogorov’s  inequality,  we  have 


(9) 


-  max  Bi  -k  0  in  probability. 
k  i<j<k  J  ' 


(10) 


Proposition  3.2  The  following  assertions  hold: 

a)  P{r2(V^)  <  oo)  =  1  for  all  n; 

b)  1  probability  os  n  — »  oo. 

Proof:  The  observation  (9)  yields  Sl/k^  -*  4/9*  with  probability  one,  so  that  part  (a)  of  the 
proposition  follows. 

Now  let  Kn  =  n/4/9*.  Fix  0  <  c  <  1  and  set  K'n  =  Kn(l  +  c).  From  (9)  and  (10)  we  have 

1 

-  max  -4-  -►  4/8*  in  probability. 

Hence,  as  n  — ►  oo,  we  have 

P  +  O}  =  P{,n,ax,^<n} 

A  similar  argument  shows  that  P  |r2(>/^)  <  <%(!  “  f)}  9.  I 
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Proposition  3.3  Er2(\/fcn)  ~  ^  as  n  <x. 

Proof:  Let  Kn  =  n/Afini  as  before.  For  k  >  2Kn  we  have 


P{T2(V^)>it}  <  P  |^<n|  <  P{ylfc<n} 


<  P 


X*' 

|D^y-2W<-^| 


The  bound  above  is  just  the  probability  of  an  event  in  the  left  tail  of  the  binomial  distribution.  An 
application  of  Chemoff’s  bound  [9]  yields: 


P  {TiiVi^)  >k}< 


where 


We  now  claim  that 

53  P  {TiiVhl)  >  k}  ^  0  (n-oo).  (11) 

k>2K„ 

If  0n  is  bounded  away  from  zero  this  is  clear:  C„  >  Z?  >  0  for  some  absolute  positive  constant  D, 
and  the  sum  in  (11)  is  just  a  sum  over  the  exponential  tail  (note  that  Zfn  oo  as  n  -*  oo).  Now 
consider  the  case  where  — ►  0.  Using  the  Taylor  series  approximation 


it  is  easy  to  see  that 


log(l  +  i)  =  I -z^/2+ 0(*®)  (|i|-»0) 


P{7’2(v^)>l:}<exp  -M{1  +  0(/?„)}  . 


It  is  now  readily  verified  that  the  first  term  in  the  series  in  (11)  decreases  exponentially  fast  with 
n.  This  completes  the  proof  of  the  claim. 

The  elementary  observation  E{T2{y/^)\T2{y/kn)  >  4Kn)  >  together  with  the  claim  now 
yields 

E(T2(\/^);  r2(v^)  >  4K„)  <  2E(T2(V^)  -  2ir„;  T2(v^)  >  4 An) 

<  2E(r2(\/^)  -  2A„;  T2iVi^)  >  2A„) 

<  2  P  {r2(\/lb^I)  >  it} 0  (n  — ►  oo). 

k>2K„ 

Hence,  the  random  variables  (4^)  T2(\/fcn),  n  >  1  are  uniformly  integrable.  In  addition,  by 

Proposition  3.2(b),  (4^)  T2{Vkn)  — ►  1  in  probability.  The  result  follows.  I 


4  ANALYSIS 


10 


4  ANALYSIS 

4.1  Directed  Drift 

We  consider  single  bit  updates  for  simplicity.  Assume  that  we  have  a  finite  set  of  patterns,  1/  = 
,u”*}  C  B”,  chosen  independently  with  pattern  components  drawn  from  a  sequence  of 
symmetric  Bernoulli  trials.  Let  {u[t]}  be  a  training  sequence,  and  {w[t]}  the  learning  sequence 
specified  by  the  rule  (2).  Let  {tk}  denote  the  subsequence  of  epochs  at  which  patterns  from  the 
training  sequence  are  misclassified;  i.e., 

(w[tfc],u[tfc])  <  0,  A:  =  1,2, - 


We  can  write  the  weight  vector  updates  of  equation  (2)  in  the  form 


w[t*+i]  =  w(<*]  +  v[t*]. 


where  v[f;t]  is  a  vector  whose  components  satisfy 


if  3  9^  j[h] 
if  3  =  3[ik]- 


Assume  that  there  is  a  binary  solution  vector,  w'  €  IB".  Consider  the  estimate  errors 


4+1  =  (|w[<fc+i]  -  w'(p 

=  Wit]  +  v[4]  -  w*lp 

=  ||w(tjt]  -  w'lp  -I-  ||v[4]|p  +  2(w[tt]  -  w',  v[t*]) 
=  +  4  +  2(w[tfc], v[tfc])  -  2(w*,  v[/jt]>. 


Using  (2)  and  (12)  we  hence  obtain 


(12) 


4+1  =  4  - 


Define  the  ±1  random  vairiables 


By  induction  we  obtain 


k 

£k+i=£i-4Y,Xi. 

«=! 


Upper  bounding  £i  by  4n,  and  setting  5*  =  53?=!  finally  obtain 


0  <  4+1  <  4(n  -  Sk). 

The  procedure  terminates  at  the  value  of  k  for  which  the  random  sum  Sk  first  exceeds  n.  The 
mistake  bound,  T,  hence  satisfies  St  >  n,  and  5*  <  n,  for  A:  =  1, . . .  ,7  —  1.  The  mistake  bound  is 
infinite  if  there  exists  no  such  value  of  A:,  or  if  there  exists  no  binary  solution  vector  for  the  choice 
of  patterns,  U. 

The  above  is  reminiscent  of  a  random  walk  with  a  fixed  boundary  at  n.  In  fact,  if  the  m-set 
of  patterns  If  is  chosen  independently,  then  the  random  variables  A,  corresponding  to  different 
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patterns  must  be  independent  Let  us  assume  that  the  training  sequence  is  obtained  by  cycling 
through  the  patterns.  For  a  random  initial  choice  of  weight  vector  a  substantial  number  of  patterns 
will  be  misclassilied,  so  that  a  pattern  will  recur  in  the  update  sequence  only  after  il{m)  epochs. 
Through  the  initial  progress  of  the  algorithm,  hence,  the  random  sum  Sk  =  Xi  is  a  sum  of 
independent,  identically  distributed,  ±1  random  variables  with 


„  _  f  1  with  probability  p,,  =  1/2  +  /?„ 

'  ~  —  1  with  probability  =  1/2  — 

for  some  /?„  €  (0,1/2].  The  specific  value  of  the  probability  p„ — the  relative  frequency  of  the 
number  of  component  matches  between  a  binary  solution  vector  and  a  pattern — depends  on  the 
size,  m,  of  the  set  of  patterns,  U.  We  clearly  have 


Pn  > 


n/2  +  1  1  .  1 

n  2  n’ 


so  that  fin  >  1/n- 

The  above  argument  indicates  that  the  expected  time  of  first  passage  to  n  of  a  one-dimensional 
random  walk.  Ski  positive  drift,  2kfini  i>e  used  as  an  estimate  of  the  expected  mistake 
bound  for  single  bit  update  Directed  Drift.  Substituting  fin  >  1/n  in  Proposition  3.1  we  then 
obtain  the  estimate  0{t?  )  for  the  expected  mistake  bound  when  the  number  of  patterns  is  within 
capacity. 

4.2  Perceptron  Training 

It  is  instructive  to  compare  the  above  convergence  rates  with  rates  that  obtain  for  Perceptron 
Training.  The  classical  proofs  of  the  Perceptron  Training  procedure  only  guarantee  that  the  proce¬ 
dure  converges  in  finite  time  if  a  solution  exists:  convergence  time,  however,  is  strongly  dependent 
on  the  distribution  of  (real)  patterns,  and  in  the  worst  case  can  be  exponential  in  the  number 
of  bits  needed  to  specify  the  pattern  distribution.  When  constraints  are  placed  on  the  allowable 
choices  of  patterns,  however,  convergence  can  be  much  more  rapid.  To  compare  mistake  bounds 
with  Directed  Drift,  let  us  consider  Perceptron  Training  when  the  patterns  are  binary,  and  under 
the  condition  that  there  exists  a  binary  solution  vector.  (Note,  however,  that  we  only  require  that 
the  Perceptron  Training  Algorithm  return  a  real  solution  vector.)  We  show  first  that  the  fixed 
increment  Perceptron  Training  Algorithm  converges  in  the  worst  case  with  a  mistake  bound  which 
increases  no  faster  than  quadratically  in  n;  alternatively,  the  total  number  of  component  updates 
before  convergence  is  0(n^).  We  follow  this  with  an  average  case  analysis,  similar  in  flavour  to  the 
analysis  for  Directed  Drift,  for  randomised,  single  component  update  Perceptron  Training,  which 
yields  similar  results. 

Let  U  =  ,  u”*}  C  B"  be  a  finite  set  of  patterns,  {u[t]}  the  training  sequence,  and  {w[f]} 

the  learning  sequence.  As  before,  let  {tjb}  denote  the  subsequence  of  epochs  at  which  patterns  from 
the  training  sequence  are  misclassified. 


^To  facilitate  ease  of  analysis  for  the  nonce  we  assume  that  the  number  of  patterns  is  within  the  computational 
capacity  of  a  linear  threshold  element  with  binary  weights.  (The  capacity  is  quite  large — linear  in  n — and  capacities 
of  the  order  of  n/logn  can  be  easily  achieved  for  rather  simple  algorithms  [4,  5].)  For  a  random  choice  of  patterns 
we  are  then  assured  with  arbitrarily  high  probability  asymptotically  that  there  exists  a  binary  solution  vector. 
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Fixed  increment  training  The  weight  vector  updates  result  in 

w[tfc+i]  =  w(tfc]  + /3u[4),  ;;  =  1,2,.... 

Assume  that  there  is  a  binary  solution  vector,  w*  €  B”.  For  a  choice  of  parameter  c  >  0  to  be 
specified  shortly,  consider  now  the  estimate  errors 

=  l|w[tfc+il  -  cw'l|2 

=  l|w[^*]  + 

=  +  /?^||u[4]|p  +  2)8(w[4]  -  cw*, u[tfc]). 

Since  w*  is  a  binary  solution  vector  we  must  have  that  (w*,u[t])  >  1  for  each  pattern  in  the 
training  sequence.  Furthermore,  ||u[t]||^  =  n  for  each  pattern  in  the  training  sequence,  and  as  w[tjt] 
misclassifies  pattern  u[t/b]  by  definition,  we  also  have  u[fjt])  <  0.  Hence 

0  <  +  0'^n  -  2c/?. 

With  the  assumption  that  w[l]  =  0  we  have  T\  =  ||w[l]  —  cw*||*  =  c*n.  By  induction  on  the  above 
inequality  we  then  obtsun 

0  <  Tk-x-x  <  —  i9(2c  -  /3n)fc. 

The  procedure  terminates  with  a  mistake  bound 

T  < - ^ - . 

-/?(2c-/?n) 

Choosing  c  =  0n  minimises  this  upper  bound  for  the  mistake  bound.  With  this  choice  of  c  we 
obtain  that  the  mistake  bound  for  the  fixed  increment  Perceptron  Training  Rule  is  T  <  under 
the  constraints  of  a  binary  pattern  space,  and  with  the  requirement  that  there  exist  a  binary 
solution  vector.  Note  that  this  bound  is  independent  of  the  number  of  binary  patterns,  and  their 
distribution.  The  sole  requirement  for  this  estimate  of  convergence  time  for  the  Perceptron  Training 
Rule  to  hold  is  that  there  exist  a  vertex  of  the  n^cube  (a  binary  solution  vector)  within  the  convex 
polyhedral  cone  defined  by  the  space  of  real  solution  vectors.  While  the  mistake  bound  gives  the 
number  of  weight  updates  before  convergence,  it  must  also  be  noted  that  in  the  Perceptron  Training 
Rule  each  update  is  a  synchronous  update  in  which  each  of  the  n  components  of  the  weight  vector 
are  modified  (as  opposed  to  the  single  bit  modifications  in  the  simpler  version  of  the  Directed  Drift 
Algorithm)  and  each  component  modification  requires  the  addition  of  a  real  scalar.  Thus,  in  the 
worst  case,  the  procedure  terminates  after  no  more  than  O(n^)  component  updates. 


Single  component  updates  The  weight  vector  updates  of  equation  (3)  can  be  written  in  the 
form 

w[ffc+i]  =  w(t*]  +  x[tk] 

where  x[tfc]  denotes  the  vector  with  components 


Xi[tk] 


f  0  if  t  t[tfc] 
\  u,[tfc]  if  I  =  i[tk]. 
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Let  w'  €  IB"  be  a  binary  solution  vector.  We  have  the  following  bounds  on  the  length-square  of 
the  weight  vector  estimates: 

CUi  =  llwl4+i]ll"  =  I|wl4l -h  x[4]ll* 

=  ^1  +  l|x(tfc]|p  +  2tWi[,*)[tfc]«i[i»][tfc] 

<  4  +  1. 

We  hence  have  £*^.1  <  fc  as  a  consequence  of  the  choice  wi  =  0.  Also,  as  a  consequence  of  the 
Cauchy-Schwarz  inequality,  we  have 


^2  ^  Kw[tfc+i],w*)|2 

-  iFP 

1  *  ^ 
=  -  ^<W*,x[tA]) 


^Kw[tk),w*)-f-(w*,x[tfc])|2 

1  *  * 

=  -  E  • 

fc=l 


Define  the  ±1  random  variables 
and  let  5*  =  IZasi  Xh-  We  then  have 


^  5  s 

y/n 


The  algorithm  terminates  at  the  first  instant  k  for  which  |5*1  exceeds  y/kn.  Arguing  as  for  Directed 
Drift,  we  use  as  an  estimate  for  the  expected  mistake  bound  the  expected  time  of  first  passage  of  a 
random  walk  with  positive  drift  2kPn  >  2k /n  to  the  two-sided  boundary  at  ±\/jtn  :  Proposition  3.3 
then  yields  the  asymptotic  estimate  O(n^)  for  the  expected  mistake  bound  as  n  — ►  oo. 


5  SIMULATIONS 

Computer  simulations  indicate  that  the  rapid  convergence  times  predicted  by  analysis  hold  when 
the  number  of  patterns  to  be  loaded  lies  within  the  capacity.  Mistake  bounds  are  plotted  as  a 
function  of  n  in  Figure  2.  In  each  plot  mistake  bounds  for  each  choice  of  m  and  n  were  averaged 
over  1000  runs  of  the  single  bit  update  Directed  Drift  Algorithm.  In  each  run  of  the  algorithm  an 
independent  set  of  patterns  was  drawn  from  a  standard  pseudo-random  binomial  number  generator. 
(To  ensure  the  existence  of  a  binary  solution  weight  vector,  a  binary  n-tuple  was  selected  at  random 
as  the  solution  vector,  and  those  patterns  lying  in  the  negative  half  space  of  the  solution  vector 
were  reflected.)  A  random  initial  binary  weight  vector  was  selected  as  the  initial  estimate  of  the 
weight  vector  presented  to  the  single  bit  update  Directed  Drift  Algorithm  with  the  training  sequence 
obtained  by  cyclically  presenting  the  patterns.  At  convergence  the  number  of  adaptations  of  the 
weight  vector  were  stored  as  the  estimate  of  the  mistake  bound.  The  expected  mistake  bound  for 
the  choice  of  m  and  n  was  evaluated  by  averaging  the  number  of  adaptations  before  convergence 
over  1000  independent  runs  (each  on  an  independently  chosen  data  set). 

In  Figures  2(a)  and  2(b)  the  number  of  patterns,  m,  was  fixed  within  capacity  (at  m  =  n/4 
and  m  =  n/2,  respectively)  and  very  rapid  convergence  times  are  seen.  Expected  mistake  bounds 
increase  significantly  around  capacity  as  illustrated  in  Figure  2(c)  with  m  =  n.  Mistake  bounds 
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finally  saturate  above  capacity  at  around  2n  (plotted  in  Figure  2(d)).  When  the  number  of  patterns 
is  well  above  capacity  it  is  not  clear  whether  the  algorithm  still  converges  polynomially  fast,  and 
this  is  under  investigation.  Convergence  is  still,  however,  several  orders  of  magnitude  faster  than 
an  exhaustive  search  through  the  vertices  of  the  n-cube.  (For  instance,  when  n  =  20  and  m  =  40, 
exhaustive  search  requires  of  the  order  of  10®  steps  while  Directed  Drift  converges  in  about  10^ 
steps.)  Early  results  indicate  that  order  of  magnitude  improvements  in  mistake  bound  may  be 
obtained  by  updating  several  bits  at  a  time  with  an  appropriate  choice  of  cooling  schedule  in 
the  algorithm.  We  do  not  have  good  heuristics  at  this  time,  however,  for  choice  of  good  cooling 
schedules. 

Figures  3  and  4  show  a  plot  of  the  expected  mistake  bound,  T,  for  single  bit  update  Directed 
Drift  as  the  number  of  patterns,  m,  increases  for  a  fixed  value  of  n.  We  observe  that  the  expected 
mistake  bound  saturates  to  a  fixed  value  depending  on  n  when  the  number  of  patterns  exceeds 
approximately  3n  in  the  range  of  n  we  considered  in  our  simulations.  Note  also  the  rather  abrupt 
threshold  behaviour  when  the  number  of  patterns  exceeds  the  capacity  (n).  We  conjecture  that 
there  is  a  threshold  function  for  the  expected  mistake  bound  around  the  capacity. 

The  saturation  of  the  mistake  bound  when  the  number  of  patterns  exceeds  capacity  is  caused 
essentially  by  the  shrinkage  in  the  solution  space — when  the  number  of  patterns  exceeds  roughly  3n, 
then  the  binary  solution  vector,  if  one  exists,  is  essentially  unique.  We  illustrate  this  in  Figure  4: 
for  a  fixed  value  of  n  we  plot  simultaneously  the  expected  mistake  bound  and  the  relative  frequency 
with  which  the  algorithm  terminates  in  an  initially  chosen  binary  solution  vector.  The  saturation 
in  mistake  bound  around  3n  is  again  evident,  as  well  as  the  threshold  behaviour  around  capacity. 
Note  that  the  probability  that  there  are  multiple  solution  vectors  is  the  dual  of  the  mistake  bound 
curve;  while  for  a  small  number  of  patterns  there  exist  many  binary  solution  vectors,  a  precipitous 
drop  in  the  probability  of  multiple  solutions  is  evidenced  around  the  capacity,  and  finally  around 
3n  there  exists  only  one  solution  vector  with  high  probability. 

This  observation  has  an  important  consequence  from  the  point  of  view  of  generalisation  in 
learning.  If  the  observed  saturation  of  the  expected  stopping  time  around  3n  patterns  extends 
uniformly  for  all  n,  then  any  linearly  separable  Boolean  function  for  which  there  exists  a  binary 
solution  vector  can  be  learnt  with  no  more  than  3n  examples  (of  the  total  of  2”  instances)  of  the 
function  drawn  at  random.  In  ongoing  work  we  are  attempting  to  make  this  rigourous. 

In  Figure  5  we  plot  the  average  number  of  component  updates  before  convergence  versus  the 
number  of  patterns  (with  n  =  10  fixed)  for  fixed  increment  Perceptron  Training.  [The  expected 
mistake  bound  is  an  order  of  magnitude  smaller;  the  average  number  of  component  updates  is 
n  times  the  mistake  bound  for  fixed  increment  Perceptron  Training.  For  single  update  Directed 
Drift,  as  noted  before,  the  mistake  bound  coincides  with  the  number  of  component  updates.]  The 
sharp  threshold  behaviour  seen  in  Directed  Drift  is  not  so  much  in  evidence  here.  Saturation  again 
appears  to  be  around  twice  the  capacity  (2n  for  real  weights).  Note  that  the  average  mistake  bound 
is  an  order  of  magnitude  lower  than  the  worst-case  upper  bound  0{n^).  The  derived  worst-case 
bound  may,  hence,  be  too  conservative.  On  the  same  figure  we  also  plot  the  normalised  length, 
L/y/n,  of  the  solution  vector  returned  by  the  algorithm.  A  similar  saturation  phenomenon  is  in 
evidence,  with  the  length  of  the  solution  vector  saturating  at  a  value  somewhat  larger  than  the 
length,  y/n,  of  the  binary  solution  vector. 
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6  CONCLUSIONS 

While  the  general  problem  of  learning  binary  weights  is  NP-complete,  the  rapid  convergence  of  the 
Directed  Drift  Algorithms  indicates  that  the  typical  problem  may  well  be  tractable  even  if  there 
exist,  perhaps  pathological,  intractable  bad  instances.  The  simplicity  of  these  probabilistic  (binary) 
learning  algorithms  allows  of  several  possible  extensions  to  networks  of  neurons — in  particular, 
feedforward  structures.  This  is  clearly  of  some  theoretical  and  practical  import,  and  is  under 
investigation. 
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Figure  Captions 

Fig.  1  (a)  A  schematic  depiction  of  the  incremental  change  in  the  current  weight  vector,  w[t],  made 
by  the  Perceptron  Trmning  Rule.  The  change  is  in  the  direction  of  the  misclassihed  pattern, 
u[t],  and  the  resultant  weight  vector,  w[t  +  1],  is  more  apt  to  classify  the  pattern  correctly, 
(b)  The  corresponding  scenario  for  Directed  Drift.  Here  the  algorithm  is  constrained  to  make 
only  single  bit  changes  in  the  current  (binary)  weight  vector.  Bit  changes  in  the  positive 
hemisphere  (in  the  direction  of  the  misclassiiied  pattern)  improve  the  possibility  of  correct 
classification. 

Fig.  2  Plots  of  the  expected  mistake  bound  (averaged  over  1000  trials)  of  the  Directed  Drift 
Algorithm  as  a  function  of  n  for  four  cases  with  the  number  of  patterns,  m,  chosen  equal  to 
n/4,  n/2,  n,  and  2n.  The  first  two  cases  are  within  capacity,  and  the  curves  reflect  best  fit 
quadratics.  For  m  =  n  and  m  =  2n  the  number  of  patterns  exceed  capacity,  and  the  curves 
fitted  are  polynomials  of  degree  4  and  5,  respectively.  The  best  fit  exponential  curve,  2"", 
above  capacity  has  an  exponent  of  roughly  0.75n  in  the  range  considered. 

Fig.  3  The  expected  mistake  bound  (averaged  over  1000  independent  trials)  of  Directed  Drift  is 
plotted  against  the  number  of  patterns  for  n  =  20.  A  threshold  phenomenon  is  observed 
around  capacity  when  the  mistake  bound  rises  abruptly.  The  mistake  bound  saturates  to  a 
fixed  value  when  the  number  of  patterns  exceeds  approximately  3n. 

Fig.  4  The  expected  mistake  bound  and  the  probability  that  Directed  Drift  terminates  in  a  dif¬ 
ferent  solution  vector  than  the  one  specified  (at  random)  initially  are  plotted  as  a  function 
of  the  number  of  patterns  for  n  =  10.  (Results  are  averaged  over  100  independent  trials.) 
Note  the  same  threshold  behaviour  around  capacity  and  the  saturation  phenomenon  for  the 
mistake  bound  as  observed  in  Figure  3.  The  probability  that  there  is  more  than  one  binary 
solution  is  a  dual  of  the  curve  for  the  mistake  bound:  the  probability  of  many  binary  solution 
vectors  is  high  when  the  number  of  patterns  is  small,  and  plunges  abruptly  around  capacity 
to  essentially  zero  around  3n.  The  saturation  of  the  mistake  bound  around  3n  patterns  thus 
seems  to  be  a  consequence  of  the  reduction  in  the  binary  solution  space  till  there  is  only  a 
unique  binary  solution  vector  around  3n  patterns.  Specifying  more  examplar  patterns  then 
does  not  yield  any  further  information  on  the  solution  vector. 

Fig.  5  The  expected  mistake  bound  and  the  length  of  the  solution  vector  (normalised  by  y/n  = 
v/To)  at  convergence  is  plotted  against  the  number  of  patterns  for  n  =  10  for  fixed  in¬ 
crement  Perceptron  Training.  (Results  were  averaged  over  a  100  independent  trials.)  The 
averaged  mistake  bound  saturates  around  6n  patterns,  reflecting  the  larger  capacity,  while 
the  normalised  length  of  the  solution  vector  returned  by  the  algorithm  saturates  at  a  value 
somewhat  larger  (about  a  factor  of  2.5)  than  the  length  of  a  binary  solution  vector. 


Mistake  Bound 


n 


n 


Mistake  Bound 


20 


I  0  20  40  60  80 

■  Number  of  Patterns 


Average  Mistake  Bound,  T 


n  =  10 

120 

00 

80 

60 

40 

20 

0 

0  10  20  30 


Number  of  Patterns,  m 


Multiple  Solution  Probability,  p 


Component  Updates,  n*T 


n  =  10 


0  H r— I 1— j r— 1 1 1 ■  I  I  III 

0  10  20  30  40  50  60  70 

Number  of  Patterns,  m 


Normalised  Length,  L*n'^{-l/2) 


To  appear: 

Proceedings  of  the  4th  Workshop 
on  Ccnputational  Learning  Theory 
(L.G.  Valiant  &  M.K.  Warmuth,  eds) 

Morgan  Kaufmann,  San  Mateo,  CA 
1991 

On  Learning  Binary  Weights  for  Majority 

Functions 


Santosh  S.  Venkatesh 
Department  of  Electrical  Engineering 
University  of  Pennsylvania 
Phildelphia,  PA  19104 
Email:  venkateshtea. npenn.edu 


Abstract 

We  investigate  algorithms  for  learning  binary 
v^eights  from  examples  of  majority  functions 
of  a  set  of  literals.  In  particular,  given  a 
set  of  (randomly  drawn)  input-output  pairs, 
with  uiputs  being  binary  ±1  vectors,  and  the 
outputs  likewise  being  ±1  classifications,  we 
seek  to  find  a  vector  of  binary  (±1)  weights 
for  a  linear  threshold  element  (or  formal  neu¬ 
ron)  which  provides  a  linearly  separable  hy¬ 
pothesis  consistent  on  the  set  of  examples. 
We  present  three  algorithms — Directed  Drift, 
Harmonic  Update,  and  Majority  Rule — for 
learning  binary  weights  in  this  context,  and 
examine  their  characteristics.  In  particular, 
we  formally  define  a  distribution  dependent 
notion  of  dgorithmic  capacity  (which  is  re¬ 
lated  to  the  distibution  free  notion  of  the  VC 
dimension)  and  provide  estimates  of  the  ca¬ 
pacity  of  the  proposed  algorithms. 


1  INTRODUCTION 

Recent  results  have  indicated  that  large  dynamic 
ranges  may  not  be  needed  for  the  weights  in  neural 
networks  [VF91].  In  particular,  for  many  applications, 
binary  weights  may  suffice  for  the  weights;  alterna¬ 
tively,  a  network  with  real  interconnection  weights  can 
be  replaced  by  an  equivalent  network  of  binary  weights 
realising  the  same  Boolean  function  with  a  slight  in¬ 
crease  in  the  size  of  the  network.  Concomittant  with 
the  birth  of  a  theory  validating  the  computational  ca¬ 
pabilities  of  networks  with  binary  (or  limited  dynamic 
range)  weights,  there  has  been  a  development  of  a  ca¬ 
pability  to  produce  large  hardware  implementations  of 
such  networks  [GH90]. 

A  question  of  some  practical  import  is  whether  there 
are  algorithms  which  can  succesfully  exploit  the  latent 
information  storage  capabilities  of  these  networks  by 
learning  binary  weights  for  a  given  architecture  from 
instances  of  the  function  to  be  represented.  Unfortu¬ 
nately,  theory  may  proscribe  a  general  solution:  the 
problem  of  learning  binary  weights  is  NP-complete. 


The  issue  is,  however,  open  whether  there  are  (rein- 
domised)  learning  algorithms  which  converge  rapidly 
on  average.  As  a  first  step  in  the  consideration  of 
this  problem,  we  consider  learning  binary  weights  in 
the  context  of  linearly  separable  functions;  i.e.,  restrict 
consideration  to  learning  binary  weights  for  a  single 
neuron.* 

Given  an  arbitrary  linearly  separable  dichotomy  of  a 
finite  set  of  patterns,  the  Perceptron  Training  Algo¬ 
rithm  [Ro662]  guarantees  convergence  in  finite  time  of 
an  iteratively  updated  sequence  of  weight  vectors  to 
a  real  solution  vector  which  separates  the  dichotomy. 
Perceptron  Training  is  an  on-line  procedure  which  has 
good  average-case  convergence  times,  but  which  can 
occasionally  exhibit  a  worst-case  exponential  time  con¬ 
vergence.  Worst-case  polynomial  running  times  can, 
however,  be  guaranteed  for  the  problem  with  off-line 
procedures  such  as  Karmarkar’s  algorithm  for  linear 
programming.  Leaning  binary  weights  for  a  neuron 
is,  however,  equivalent  to  integer  programming,  which 
is  known  to  be  NP-complete  [GJ79].^ 

We  present  three  approaches  to  the  problem  of  learn¬ 
ing  binary  weights  for  linear  threshold  functions,  the 
target  functions  being  majority  functions  of  a  set  of  n 
literals.  In  the  first  approach  we  present  a  randomised, 
local,  and  homogeneous  on-line  procedure — which  we 
call  Harmonic  Update  [Ven91(a)] — for  learning  binary 
weights  from  a  single  pass  of  a  set  of  examples.  The 
second  algorithm  we  present  is  a  homogeneous  off¬ 
line  procedure  we  call  Majority  Rule  [Ven91(a)].  In 
the  third  approach  we  develop  a  family  of  randomised 
algorithms— dubbed  Directed  Drift  [Ven91(b)] — which 
2ire  on-line,  local,  and  mistake  driven. 

A  key  parameter  we  estimate  is  the  capacity,  an  al¬ 
gorithm  and  distribution  dependent  parameter  linked 
to  the  VC  dimension.  The  main  results  here  are 
that  Harmonic  Update  has  a  capacity  of  the  order  of 
y/n/y/logn,  Majority  Rule  has  a  capacity  of  the  order 
of  n/logn,  and  Directed  Drift  has  a  capacity  of  order 

*In  this  discussion,  we  shall  use  the  term  “neuron”  syn¬ 
onymously  with  a  linear  threshold  element. 

*Some  problems  are  born  to  NP-completeness,  some 
attain  NP-completeness,  and  other  have  NP-completeness 
thrust  upon  ’em. 


between  n/logn  and  n.  Furthermore,  these  capacities 
are  maximal  among  algorithms  with  the  respective  fea¬ 
tures  of  these  three  algorithms. 

The  Harmonic  Update  Algorithm,  while  on-line,  is  not 
mistake  driven  and  terminates  after  a  single  pass  of  the 
set  of  examples.  Mistake  bounds  or  convergence  times 
for  the  Directed  Drift  Algorithm  are,  however,  much 
harder  to  obtain.  A  feeling  for  the  problem  can  be 
obtained,  however,  by  appealing  to  analogous  situa¬ 
tions  in  the  theory  of  random  walks  and  the  geomet¬ 
ric  theory  of  paths.  The  corresponding  problem  here 
involves  the  estimation  of  the  expected  time  of  first 
passage  of  a  random  walk  with  positive  drift  to  a  fixed 
boundary  at  n.  We  obtain  the  estimate  0(n?)  for  the 
expected  time  of  first  passage  to  the  boundary  and  ar¬ 
gue  heuristically  that  this  may  hold  as  an  estimate  for 
the  expected  mistake  bound  for  Directed  Drift  when 
the  number  of  examples  is  within  the  capacity  of  the 
algorithm.  In  an  appendix  we  also  provide  a  compar¬ 
ison  of  the  rate  of  convergence  of  Directed  Drift  with 
Perceptron  Training:  we  show  that  the  corresponding 
worst-case  and  average-case  number  of  component  up¬ 
dates  for  Perceptron  Training  is  0{n^). 

On  notation:  We  wiU  use  the  symbol  IB  to  denote  the 
set  {-1,1}.  If  X  =  (*!,...  ,x„)  and  y  =  (yi,...  ,y„) 
are  points  in  real  Euclidean  n-space,  we  denote  by 
(x,  y)  the  inner-product  23"-i  Xj-yy-  We  use  the  word 
epoch  to  denote  points  on  the  time  axis.  A  physical 
weight  update  may  take  some  time,  but  we  will  as¬ 
sume  updates  are  timeless  and  occur  at  epochs.^  We 
define  the  function  sgn  :  IR  — »  IB  by  sgn  x  =  x/|x|  if 
X  ^  0  and  sgnO  =  1.  All  logarithms  in  the  exposition 
are  to  base  e.  If  {x„}  and  {yn}  are  positive  sequences, 
we  denote:  x„  <  y„  if  x„  <  y„  for  n  large  enough; 
x„  Z  Vn  if  >  y„  for  n  large  enough. 

2  THE  SETTING 

We  are  given  a  set  of  patterns,  U  C  IB" ,  and  a  function 
/  :  2/  — »  B  which  is  linearly  separable:  specifically, 
there  exists  a  (binary)  solution  weight  vector,  w*  € 
B" ,  such  that 

8gn(w*,u)  = /(u)  (1) 

for  every  choice  of  pattern  u  €  2/.  We  call  the  func¬ 
tion  /  the  target  function-,  these  are,  hence,  majority 
functions  of  a  set  of  literals.  Given  2/  and  a  linearly 
separable  target  function  /,  the  goal  is  to  efficiently 
find  a  (binary)  solution  weight  vector  w  €  B".  Note 
that  /  dichotomises  the  set  of  patterns  U.  Without 
loss  of  generality  we  assume  that  /(u)  =  1  for  every 
pattern  u  €  2/.'* 

®In  his  text,  W.  Feller  [Fel68,  page  73]  credits  J.  Rior- 
dan  with  initiating  the  usage  of  the  word  epoch  in  such 
situations. 

*Fot  the  nonce,  extend  /  to  the  domain  IB"  using  the 
relation  (1).  If  /(u)  =  -1  then  /(-u)  =  1  as  can  be  eas- 


2.1  ALGORITHMS 

In  an  off-line  binary  learning  algorithm,  weights  Wi  € 
B,  t  =  1,  . . . ,  n  are  produced  directly  as  a  function  of 
the  set  of  patterns  U:  more  specifically,  Wi  =  gi{U), 
where  the  functions  gi  :  U  —*  JB  are  specified  by  the 
algorithm.  We  say  that:  an  off-line  algorithm  is  local  if 
Wi  is  determined  solely  from  the  tth  component  of  the 
patterns,  Wi  =  yi({u<  :  u  G  2/}),  for  each  t  =  1,  . .. , 
n;^  a  local  off-line  algorithm  is  homogeneous  if  there  is 
a  function  g  such  that  w,-  =  y({u,  :  u  €  22}),  for  each 
i=  1,  ...,  n. 

In  contrast,  an  on-line  algorithm  for  learning  from  ex¬ 
amples  is  a  procedure  where  learning  takes  place  in  a 
sequence  of  trials.  The  protocol  is  as  follows: 

1®  At  epoch  t  the  system  is  characterised  by  a  weight 
vector,  w[t]  G  B",  and  receives  an  example  pat¬ 
tern,  u[2]  G  B",  drawn  from  22. 

2®  The  system  produces  a  response  t;[t]  G  B  according 
to  the  sign  of  (w[<],u[t]). 

3®  A  new  weight  vector,  w[2  +  1]  G  B",  is  generated 
based  on  the  current  response  vff]  G  B,  weight 
vector  w[2]  G  B",  and  example  uff]  G  B". 

The  procedure  is  carried  out  iteratively,  and  is  termi¬ 
nated  if  a  solution  weight  vector  is  obtained.  Note 
that  we  restrict  ourselves  to  on-line  algorithms  which 
generate  binary  weight  vectors,  w[t]  G  B",  at  each 
epoch  of  the  learning  process;  specifically,  the  weights 
are  confined  to  the  domain  {—1, 1}  throughout  learn¬ 
ing.  This  situation  may  be  compared  to  Perceptron 
Learning  where  the  weights  typicaUy  grow  in  magini- 
tude  during  the  learning  process.  We  call  the  sequence 
of  examples,  {u[t]}^j,  the  training  sequence,  and  the 
sequence  of  wei^t  vectors,  {w[t]}Jli,  the  learning  se¬ 
quence.  If  the  procedure  terminates  in  a  finite  time, 
we  say  that  the  learning  algorithm  has  learnt  the  func¬ 
tion  /.  We  will  be  interested  in  the  mistake  bound  T — 
the  number  of  classification  mistakes  the  learning  algo¬ 
rithm  makes  on  the  training  sequence  before  it  learns 
the  given  function.  In  particular,  the  mistake  bound 
is  equal  to  the  number  of  epochs  for  which  (w[t],u[l]) 
is  not  positive.  For  our  purposes,  the  mistake  bound 
is  equal  to  the  actual  number  of  updates  of  the  weight 
vector  before  the  function  is  learnt. 

We  say  that  an  on-line  learning  algorithm  is  local  if 
each  weight,  u»,[f  d-  1]  G  B,  is  updated  solely  as  a 
function  of  Wi[f],  ^d  v[t],  i.e.,  for  each  i  =  1, 

ily  seen  from  (1).  Replacing  each  pattern  in  22  for  which 
/(u)  =  —  1  by  — u  we  obtain  a  corresponding  set  of  pat¬ 
terns  22;  if  w*  is  any  solution  weight  vector  separating  the 
dichotomy  of  22  specified  by  /  then  all  patterns  in  22  lie  on 
the  same  (positive)  side  of  the  hyperplane  corresponding 
to  w',  and  conversely. 

*We  abuse  notation  somewhat  here  by  retaining  the 
same  functional  notation  g,  over  different  domains. 


. . . ,  n,  there  is  a  sequence  of  (possibly  probabilistic) 
functions  fi^t  such  that 

+  1]  =  fi,t  (u),[t],  v[t]) ,  i  =  1, . . .  ,n. 

In  analogy  with  the  corresponding  situation  for  off¬ 
line  algorithms,  we  say  that  a  local  on-line  algorithm 
is  homogeneous  if  all  weights  at  a  given  epoch  have  the 
same  update  rule,  i.e.,  there  is  a  sequence  of  (possibly 
probabilistic)  functions  /<  such  that 

+  !]  =  /«  (tni(<], «<[<], w[<]) .  t  =  1, . . .  ,n. 

In  addition,  we  say  that  an  on-line  algorithm  is  single¬ 
pass  if  each  pattern  u  €l/  occurs  exactly  once  in  the 
training  sequence. 

2.2  CAPACITY 

For  positive  integers  m  and  n,  let  W"  =  {u‘, . . .  .u”*} 
be  a  random  m-set  of  patterns  chosen  independently 
from  IB";  specifically,  for  each  a  =  1,  ...,  m,  the 
patterns  u"  =  (uf , ...  ,  u")  are  chosen  independently, 
with  the  components  uf  of  each  pattern  drawn  from  a 
sequence  of  symmetric  Bernoulli  trials 

PK  =  -1-1}  =  P{«f  =  -1}  =  1/2. 

Let  A  be  an  algorithm  for  learning  binary  weights, 
and  let  P^(n,m)  be  the  probability  that  A  produces 
a  solution  weight  vector  for  the  m-set  of  patterns 
i.e.,  P.4(n,m)  is  the  probability  that,  given  the  m-set 
of  patterns  the  algorithm  A  yields  a  weight  vector 
w^(W”)  e  IB"  such  that  (w>(W”),u)  >  0  for  every 

u€Wr- 

Definition  2.1  We  say  that  a  sequence  Cn  is  a  ca¬ 
pacity  function  (or  simply,  capacity)  for  A  if,  for  any 
choice  of  0  <  A  <  1,  the  following  two  properties 
hold: 

a)  P4(n.  »^)  — ♦  1  as  n  — ►  oo  for  every  sequence 
{m„}  which  is  such  that  m„  ;$,(!  —  A)C„; 

b)  P>t(n,  mn)  — >  0  as  n  — ►  oo  for  every  sequence 
{i7i„}  which  is  such  that  m„  ^  (I  -b  A)C„. 

We  also  say  that  C„  is  a  lower  capacity  if  property  (a) 
holds,  and  that  Cn  is  an  upper  capacity  if  property  (b) 
holds. 

This  notion  of  the  capacity  function  has  a  counter¬ 
part  in  the  theory  of  random  graphs  in  the  notion  of 
a  threshold  function  of  an  attribute.  Loosely  speaking, 
the  capacity  function  quantifies  the  capability  of  the 
algorithm  under  consideration  by  specifying  the  size  of 
the  “largest  typical  set”  of  patterns  for  which  “most” 
dichotomies  are  separated  by  the  algorithm  with  high 
probability.  Note  that  the  capacity  function  depends 
implicitly  upon  the  choice  of  distribution  for  the  pat¬ 
terns.  We  could  allow  other  distributions,  or  more 
generally,  a  family  of  distributions. 


Lower  and  upper  capacities  in  a  natural  sense  provide 
lower  and  upper  bounds  on  the  capabUity  of  an  algo¬ 
rithm.  Note  that  while  lower  and  upper  capacities  are 
guaranteed  to  exist,  the  capacity  function  itself  may 
not  (though  it  frequently  does).  Capacity  requires  a 
sharp  threshold  characterisitic  in  the  computational 
capability  of  the  structure  and  the  algorithm.  Capac¬ 
ity  functions  are  not  unique,  even  when  they  exist. 
The  following  result  (an  easy  consequence  of  the  defi¬ 
nition)  shows,  however,  that  if  capsicity  functions  ex¬ 
ist,  then  they  are  not  very  different  from  each  other 
asymptotically. 

Proposition  2.2  If  Cn  is  a  capacity  function  then  so 
is  Cn(l±o(l)).  Conversely,  i/C„  andC'^  are  any  two 
capacity  functions,  then  C„  ^  C^  as  n  -*  oo. 

The  capacity  definition  can  be  seen  to  be  a  sort  of 
distribution  dependent  analogue  of  the  VC  dimension. 
For  learning  in  a  distribution  free  setting,  Blumer,  et 
al  [BEHW89]  invoke  the  sufficient  conditions  for  uni¬ 
form  convergence  of  relative  frequencies  of  events  to 
their  probabilities  derived  in  the  seminal  paper  of  Vap- 
nik  and  CSrvonenkis  [VC71]  to  show  that  the  sample 
complexity  for  learning  is  proportional  to  the  VC  di¬ 
mension  of  the  hypothesis  class  under  consideration.  A 
siirular  argument  utilising  the  necessary  and  sufficient 
conditions  derived  in  the  Vapnik-CSrvonenkis  paper 
can  be  adduced  to  show  that  the  sample  complexity 
for  learning  in  a  distribution  dependent  setting  should 
be  proportional  to  the  (probabilistic)  capacity.^ 

A  natural  question  then  is  how  this  (distribution  de¬ 
pendent)  notion  of  capacity  is  linked  to  the  VC  dimen¬ 
sion.  We  are  dealing  here  with  a  sequence  of  hypothe¬ 
sis  classes  Hn — the  family  of  half-spaces  corresponding 
to  binary  weight  vectors  from  IB" — with  corresponding 
VC  dimensions  d„  <  n.  Using  the  fact  that  the  num¬ 
ber  of  dichotomies  of  an  m-set  of  patterns  induced  by 
the  hypothesis  class  Hn  is  majorised  by  m*'*  -f  1 ,  it  is 
easy  to  show  the  following  general  result. 

Theorem  2.3  Any  lower  capacity  function  ^  satis¬ 
fies  Cn_  =  0(dn  log*  dn)  OS  n  -♦  OO. 

The  above  result  holds  for  all  choices  of  algorithm 
and  distribution.  (In  fact,  the  VC  dimension  can  be 
thought  of  as  a  special  case  of  the  capacity  when  the 
algorithm  allows  an  exhaustive  search  of  the  hypoth¬ 
esis  class,  and  all  distributions  are  allowed.)  The  ca¬ 
pacity  can,  hence,  never  exceed  the  VC  dimension  by 
very  much,  whatever  be  the  choice  of  algorithm  and 
distribution  family.  It  is  possible,  however,  to  have 
capacities  substantially  smaller  than  the  VC  dimen¬ 
sion  [Ven91(c)].  Distribution  families  for  which  this  is 
true  will  then  demand  much  smaller  sample  complex¬ 
ities  than  the  distribution  free  case. 

*Some  slight  additions  have  to  be  made  to  the  defini¬ 
tion,  but  these  are  not  critical  in  a  network  setting. 


The  capacity  definition  above  can  be  readily  extended 
to  more  general  situations  where  we  have  an  arbi¬ 
trary  sequence  of  computational  structures  (hypoth¬ 
esis  classes  Hn),  distribution  families,  algorithms,  and 
computational  attributes  more  complex  than  correct 
classification  (such  as  associative  memory  with  error 
tolerance)  [Ven91(c),  Ven91(d)].  (For  examples  of  ca,- 
pacity  calculations  within  this  framework  in  a  neu¬ 
ral  network  setting  see  (KP88(a),  KP88(b),  MPRV87, 
N88,  Ven91(e),  VB91(a),  VB91(b),  VP91],  for  in¬ 
stance.)  The  basic  properties  derived  above  (Propo¬ 
sition  2.2  and  Theorem  2.3)  continue  to  hold  in  the 
general  case. 

3  HARMONIC  UPDATE 

The  first  algorithm  we  introduce,  dubhed  Har¬ 
monic  Update,  is  a  single-pass,  homogeneous,  on¬ 
line  randomised  algorithm  for  learning  binary  weights 

[Ven91(a)].  As  before,  let  W"  =  {uT . u"*}  be  a 

random  m-set^  of  patterns  in  IB”  drawn  independently, 
and  with  components  drawn  from  a  sequence  of  sym¬ 
metric  Bernoulli  trials.  The  training  sequence  consists 

of  the  patterns,  . u™,  presented  in  turn.  Let 

the  initial  choices  of  the  weights,  w,'[l]  €  IB,  be  arbi¬ 
trary,  and  let  w  =  w[m  4- 1]  be  the  hniil  weight  vector 
returned  by  the  algorithm.  Harmonic  Update  is  a  ran¬ 
domised  algorithm  which  prescribes  weight  updates  as 
follows. 

For  t  =  1,  . . . ,  n,  and  epochs  t  =  1,  . . . ,  m; 

•  If  Wi[t]  =  uj,  then  set  Wi[t  -f-  1]  = 

•  Ifwift]  =  — u{,  then  set  «;,•(< 4-1]  =  — U)i[f] 
with  probability  i/t,  and  w,-[(4-l]  = 
with  probability  1  —  1/t. 

Clearly,  Harmonic  Update*  is  a  randomised  on-line  al¬ 
gorithm,  and,  as  claimed,  it  is  homogeneous  and  single¬ 
pass.  The  algorithm  is  not  mistake  driven,  and  as  each 
example  pattern  is  seen  exactly  once,  the  algorithm 
terminates  after  the  minimal  number  of  steps,  m.  The 
effect  of  this  randomised  procedure  is  to  ensure  that 
each  weight  retains  an  equal  amount  of  information 
about  the  corresponding  component  of  every  pattern. 
In  particular, 

Eu;,u?'  =  — ,  a=l,...,m,  i=l,...,n,  (2) 


^For  capacity  calculations  we  seek  m,  the  number  of 
patterns,  as  an  explicit  function  of  n.  To  keep  the  notation 
simple,  however,  we  write  m  instead  of  ntn,  which  would 
make  the  dependence  explicit,  with  the  tacit  understanding 
that  the  number  of  patterns  is  actually  a  function  of  n. 

*The  name  arises  from  the  choice  of  the  sequence  of 
probabilities  {l/t,t  >  1}  in  the  algorithm:  at  epoch  t,  1/t 
is  the  probability  that  a  weight  u^ate  results  in  a  change 
in  sign  of  the  weight  when  the  current  weight  and  the  cor¬ 
responding  component  of  the  pattern  from  the  training  se¬ 
quence  diMree  in  sign. 


as  can  be  readily  seen  by  induction.  The  following 
estimate  now  holds; 

Theorem  3.1  The  sequence  %/n/Vlogn  ts  a  capacity 
function  for  the  Harmonic  Update  Algorithm.  More¬ 
over,  no  other  homogeneous,  single-pass  algorithm  has 
a  capacity  function  with  a  more  rapid  rate  of  growth. 

Remarks;  An  application  of  the  second  moment 
method  shows  that  -jn/y/Xog  n  is  a  lower  capacity,  and 
for  brevity,  we  will  restrict  ourselves  to  proving  this 
here.  To  prove  that  y/n/y/log  n  is  also  an  upper  capac¬ 
ity  for  Harmonic  Update  calls  for  some  delicate  foot¬ 
work  with  weakly  dependent  random  variables.  The 
main  ideas  involved  are  the  observation  that  the  ran¬ 
dom  variables  Y°  defined  below  are  exchangeable,  to¬ 
gether  with  a  large  deviation  “Poissonisation”  argu¬ 
ment  which  shows  that  the  errors  are  Poisson  dis¬ 
tributed  asymptotically.  The  proof  that  this  capacity 
is  maximal  for  homogeneous,  single-pass  algorithms 
utilises  some  maximin  inequalities  proved  in  [VF91]. 
Detmis  and  the  complete  proof  are  given  in  [Ven91(a)]. 

Proof:  [Sketch.]  Define  the  random  variables 

Y/’  =  Wjuf,  j  =  l,...,n,  o=l,...,m. 

Note  that  by  the  locality  of  the  Harmonic  Update  Al¬ 
gorithm,  for  each  j  =  1,  . . . ,  n,  the  weight  Wj  depends 
solely  on  the  jth  components  uj,  . . . ,  uJ*  of  the  pat¬ 
terns.  By  independence  of  the  pattern  components, 
and  by  symmetry,  it  follows  that  the  weights  wi,  ..., 
w„  are  i.i.d.,  symmetric  Bernoulli  random  variables 
taking  values  -1  and  1  only,  each  with  probability  1/2. 
It  hence  follows  that  for  every  fixed  a,  the  random  vari¬ 
ables  y®,  ...,  are  independent,  ±1  random  vari¬ 
ables,  and  as  Harmonic  Update  is  homogeneous,  they 
are  identically  distributed  as  well.  An  inductive  ar¬ 
gument  similar  (and  only  slightly  more  detailed)  than 
the  one  used  to  establish  (2)  yields 

=  1-2^  =  ’ 

P(n'  =  +1)  =  1+2^  =  » 

Now  form  the  random  sums 
n 

^■“  =  23^/-  "  =  i . 

>=i 

If  w  is  to  be  a  solution  vector,  we  require  that  each 
A®  be  positive.  Let  us  estimate  instead  the  probabil¬ 
ity  that  a  particular  pattern,  say  u®,  is  not  correctly 
classified  by  w.  The  following  exponential  inequality 
due  to  Hoeffding  now  proves  useful. 

Lemma  3.2  [Hoeffding]  Let  Z\,  ...,  Zn  be  indepen¬ 
dent  random  variables  with  zero  means  and  bounded 


ranges:  Oj  <  Zj  <  bj.  Then  for  every  ij  >  0, 


As  X°  is  the  sum  of  n  i.i.d.  ±1  random  variables 
with  mean  p—q  =  1/m,  HoeiFding’s  inequality  directly 
yields® 

The  probability  Pi{u{n,m)  that  Harmonic  Update 
correctly  classifies  each  of  the  patterns  in  is 
bounded  below  by  a  simple  application  of  Boole’s  in¬ 
equality: 

PHu{n,m)  >  <  0}  >  1-m  ®*p(~2i^)  ' 

Now,  for  any  e  >  0,  the  choice 


m 


y/n  r  loglogn  +  21ogf  _  ^  /loglognNI 
>/logn  21ogn  A(logn)^/J 


yields  PHujn,  m)  Z,  1  —  c  as  n  — *  oo.  It  follows  that 
y/n/y/log  n  is  a  lower  capacity  function  for  Harmonic 
Update.  I 


4  MAJORITY  RULE 


If  off-line  procedures  are  permitted,  substantial  gains 
in  capacity  can  be  made.  The  Majority  Rule  described 
below  is  a  homogeneous,  off-line  algorithm  with  near 
maximal  capacity. 

As  before,  let  W”  =  {u',...  .u"*}  be  a  random  m- 
set  of  patterns  in  B",  with  components  chosen  from  a 
sequence  of  symmetric  Bernoulli  trials.  The  Majority 
Rule  prescribes  weights  as  follows: 


ur 

u- 


n,  let 

:  {u*  €f/ :  uj  = +1} 

:  {u‘ €  W  :  u‘ = -1}. 

-H 

-1 


In  other  words,  «;<  =  -bl  if  patterns  whose  »th  compo¬ 
nent  is  -f  1  are  in  the  majority,  and  in,  =  —  1  otherwise. 
Clearly,  Majority  Rule  is  an  off-line  algorithm  which 
is  local  and  homogeneous.  The  following  estimate  can 
be  obtained: 

*A  slightly  more  involved  argument  invoking  the  large 
deviation  version  of  the  classical  De  Moivre- Laplace  cen¬ 
tral  limit  theorem  (cf.  Feller’s  text  [Fel68],  for  instance) 
yields  that  if  m  grows  with  n  such  that  m  =  o(y/n)  and 
m/n'/*  —  oo,  then  P{X“  <  0}  ~  “ 

n  oo.  This  more  precise  estimate  is  ne^ed  to  show  that 
\/n/y/logn  is  also  an  upper  capacity  for  Harmonic  Update. 


Theorem  4.1  The  sequence  n/xlogn  is  a  capacity 
function  for  the  Majority  Rule  Algorithm.  Moreover, 
no  other  homogeneous,  off-line  algorithm  has  a  capac¬ 
ity  function  with  a  more  rapid  rate  of  groxotk. 


Remarks:  Again,  we  will  content  ourselves  here 

with  providing  a  sketch  of  the  proof  that  n/xlogn  is 
a  lower  capacity  for  Majority  Rule.  Details  may  be 
found  in  [Ven91(a)]. 

Proof:  [Sketch.]  We  begin  by  noting  that  with 

probability  one  we  can  write 

twj  =  sgn^5Z“i j  >  j  =  l,...,n. 

As  before,  for  o  =  1,  . . . ,  m,  form  the  sums 
n  n 

j=\  i=i 

where  the  ±1  random  variables  Yf*  are  defined  by 


Note  that,  as  before,  for  fixed  q,  Y°  ,  ...,Y°  me  i.i.d., 
±1  rwdom  variables.  Now,  the  summands  in  the  sum 
in  (3)  are  i.i.d.,  synunetric  Bernoulli,  ±1  random  vari¬ 
ables,  so  that  the  sum  is  just  a  synunetric  random 
walk  over  m  —  1  steps.  Let  m  grow  without  bound 
as  n  — »  oo.  An  apphcation  of  Stirling’s  formula  then 
yields 


p{y/  =  -1} 


1  1 
2  y/2irm 


P{y^‘*  =  -»-!} 


1  1 
2*7^- 


An  application  of  Hoeffding’s  inequality  as  in  Theo¬ 
rem  3.1  hence  yields 

P{X”<0)<.xp(-^), 

Using  Boole’s  inequality,  as  before,  yields  that  the 
probability  that  Majority  Rule  correctly 

classifies  each  of  the  patterns  in  is  bounded  below 
by 

PMR(n,m)  >  l-mP{X“  <  0}  >  1-m  exp( — 

\  lem' 


Now,  for  any  «  >  0,  the  choice 

m  =  "  [l  +  ^°g^08»-i-logyt  _  Q  /loglognXl 

Jrlogn  [  logn  \0ognP/J 

yields  ^1  —  casn— *cx3.  It  follows 

that  n/vlogn  is  a  lower  capacity  function  for  Majority 
Rule.  I 


5  DIRECTED  DRIFT 

We  conclude  with  a  family  of  randomised,  on-hne, 
local  (but  non-homogeneous)^°  algorithms  for  binary 
learning  [Ven91(b)].  We  call  these  algorithms  Directed 
Drift  because,  as  we  shall  see,  they  share  some  similar¬ 
ities  with  asymmetric  random  walks  with  a  preferred 
direction  toward  a  solution. 

Let  U  be  any  subset  of  patterns  from  IB" ,  and  let  {u[l]} 
be  any  training  sequence  such  that  each  of  the  patterns 
in  U  appears  infinitely  often. Let  {w[f]}  denote  a  bi¬ 
nary  learning  sequence.  For  each  epoch,  t,  we  denote 
by  J[f]  the  subset  of  indices  for  which  the  correspond¬ 
ing  components  of  w[f]  and  u[t]  are  opposite  in  sign: 

J[t]  =  {j 


5.1  SINGLE  BIT  UPDATES 


In  the  simplest  version  of  Directed  Drift,  no  more  than 
a  single  component  of  the  weight  vector  is  updated  per 
epoch. 


Base;  w(1]  G  IEP  is  chosen  arbitrarily. 
Iteration;  Weight  updates  are  predicated 
upon  whether  a  correct  or  incorrect  response 
is  obtained  at  the  current  epoch,  t. 

•  If  (w[t],u[f])  >  0,  then  the  weight  vector 
is  left  unchanged:  w[f  +  1]  = 

•  If  {w[f],u[f])  <  0,  then  an  index  j[f]  is 
picked  at  random  from  the  set  of  indices, 
J[t],  of  mismatched  components.  The 
new  weight  vector  is  now  formed  accord¬ 
ing  to  the  following  rule: 


•f  J  #  i[<] 


(4) 


The  intuition  behind  the  algorithm  is  as  follows.  If 
a  binary  solution  vector,  w*  G  IB",  exists,  then  nec¬ 
essarily  we  must  have  (w*,u)  =  >  0  for 

‘°Tke  algorithms  actually  have  a  slightly  non-local 
preamble  at  each  epoch.  We  will  ignore  this  non-locality 
and  continue  to  call  the  algorithms  local. 

"Note  that  U  C  IB"  is  a  finite  set  of  patterns.  If 
U  =  {u*,.. .  lU”*}  is  an  m-set  of  patterns,  then  we  can, 
for  instance,  obtain  valid  training  sequences  by  cycling 
through  the  patterns  or  by  choosing  a  pattern  randomly 
at  each  epoch. 

"If  it  ain’t  broke,  don’t  fix  it. 


each  pattern  u  G  W.  As  there  is  a  contribution  of 
-1-1  to  the  sum  if  two  corresponding  components  of  w' 
and  u  have  the  same  sign,  and  -1  if  the  signs  are  mis¬ 
matched,  it  follows  that  the  binary  solution  vector  has 
more  component  sign  matches  than  mismatches  with 
each  pattern  in  U. 

Now  the  algorithm  updates  the  current  estimate  of  the 
weight  vector  if  and  only  if  the  current  pattern  from 
the  training  sequence  is  misclassified.  A  weight  vector 
update  results  in  a  randomly  chosen  mismatched  com¬ 
ponent  of  the  weight  vector  being  flipped  to  the  sign  of 
the  corresponding  pattern  component.  Since  there  is 
a  probability  better  than  a  half  that  a  randomly  speci¬ 
fied  component  of  any  pattern  has  the  same  sign  as  the 
corresponding  component  of  a  binary  solution  vector, 
it  follows  that  at  least  during  the  initial  progress  of 
the  algorithm,  the  a  priori  probability  that  the  weight 
vector  update  is  in  the  direction  of  the  binary  solution 
vector  is  better  than  a  half.  We  will  explore  this  more 
formally  in  the  sequel. 

5.2  ANALYSIS 

Let  w*  G  IB"  denote  a  solution  vector,  and  let  {f*} 
denote  the  subsequence  of  epochs  at  which  patterns 
from  the  training  sequence  are  misclassified;  i.e., 

(w[f t],  u[ft])  <  0,  *=1,2 . 

Let 

^k+i  =  l|w[*+  1]- w'll* 

denote  the  estimate  error  at  epoch  k  for  single  bit  up¬ 
date  Directed  Drift.  A  strai^tforward  inductive  ar¬ 
gument  then  gives 

k 

Sk+i=£i-4j^Xi, 

t=i 

where  we  define  the  ±1  random  variables  by 

Upper  bounding  £i  by  4n  and  setting  5t  = 
we  then  obtain 


0<f*+i  <4(n-5i). 

The  procedure  terminates  at  the  value  of  k  for  which 
the  random  sum  St  first  exceeds  n.  The  mistake 
bound  T  hence  satisfies  Sy  >  n,  and  St  <  n  for 
k  =  1,...  ,T  —  1.  The  mistake  bound  is  infinite  if 
there  exists  no  such  value  of  k,  or  if  there  exists  no 
binary  solution  vector  for  the  choice  of  patterns  U. 

The  above  is  reminiscent  of  a  random  walk  with  a  fixed 
boundary  at  n.  Let  the  m-set  of  patterns  U  be  cho¬ 
sen  independently,  with  components  drawn  from  a  se¬ 
quence  of  symmetric  Bernoulli  trials,  and  assume  m  is 
within  the  capacity  of  the  algorithm.  (With  high  prob¬ 
ability,  then,  there  exists  a  solution  vector.)  Through 


the  initial  progress  of  the  algorithm  then,  the  random 
variables  Xi  will  be  independent  with 

X  =  {  probability  =  1/2  -  /9„ 

•  \  1  with  probability  p„  ,„  =  1/2  + 


Now,  the  random  variables  Xj  and  are  inde¬ 

pendent  for  every  j  >  1,  so  that  by  the  dominated 
convergence  theorem  and  the  definition  of  the  stop¬ 
ping  time  7i  we  have 


for  some  €  (0,1/2].  The  specific  value  of  the 
probability  pn,m — the  relative  frequency  of  the  num¬ 
ber  of  component  matches  between  a  binary  solution 
vector  and  a  pattern — depends  explictly  both  on  n  and 
the  number  of  patterns  m.  Clearly, 


Pn,m 


>  n/2+l  ^  11 

n  2  n’ 


n  _  E5t,  =  E  Aj  I{T,>n 
1=1 
OO 

=  Ee(^;)P{7i  >>}  =  E(Xi)E(Ti). 

i=i 

The  result  follows.  | 


so  that  ^n.m  >  1/n.  A  heuristic  estimate  of  the  ex¬ 
pected  mistake  bound  for  single  bit  update  Directed 
Drift  when  m  is  within  capacity  can  hence  be  ob¬ 
tained  from  the  expected  time  of  first  passage  to  n 
of  a  one-dimension^  random  walk  St,  with  positive 
drift  2k0n,m-  The  following  result  is  an  application  of 
Wald’s  equation: 

Lemma  5.1  Lei  be  i.i.d,,  ±1  random  variables 
with  mean  EXi  =  2/?„,„  >  2/n,  and  let  St  = 

denote  a  random  walk  with  positive  drift 
ES*  =  >  2it/n.  Lei 

Ti  =  inf{k  :  St  =  n} 

denote  the  time  of  first  passage  to  the  fixed  boundary 
at  n.  TAen 


for  every  n. 

Proof:  Let  Xt  =  <t{Xi,  . . .  ,Xt),  so  that  {J'*,*  > 
1}  is  an  increasing  sequence  of  sub-<T-algebras.  Ti  is 
clearly  a  stopping  time  with  respect  to  {.?■*}.  Now,  as 
j  — ►  OO,  we  have 

E{Ti>j}  =  P{Si<n,...  ,Sj.i<n} 

<  P{S,_,  <  n}  =  O(c-*0 

for  a  positive  constant  c,  as,  for  fixed  n,  the  penulti¬ 
mate  expression  is  the  probability  in  the  extreme  left 
tail  of  the  binomial  distribution.  Hence, 

OO 

Eri  =  5^p{r,  >j}<oo 
j=i 

as  the  terms  of  the  series  decrease  exponentially  fast. 
For  any  measurable  set  A,  let  be  the  indicator  ran¬ 
dom  variable  for  A.  We  now  have 

E|5t,|  <  E 

i=i 

=  <  OO. 

i=i 


Using  E  7\  as  a  rough  estimate  for  the  expected  mis¬ 
take  bound  for  single  bit  update  Directed  Drift  results 
in  the  estimate  0[n?).  Simulations  confirm  this  rapid 
convergence  of  Directed  Drift  when  the  number  of  ex¬ 
amples  is  within  the  capacity  of  the  algorithm. 

There  are  a  number  of  open  issues  about  the  algorithm 
which  we  are  currently  in  the  process  of  resolving.  A 
rigourous  and  general  analysis  of  stopping  times  for 
the  algorithm  involves  careful  consideration  of  the  ma¬ 
trix  of  transition  probabilities  of  a  finite  Markov  chain, 
the  transition  probabilities  depending  both  on  n  and 
m.  Capacity  estimates  for  the  algorithm  are  currently 
between  the  orders  of  n/logn  and  n.  The  upper  ca¬ 
pacity  estimate  of  n  is  an  immediate  consequence  of 
Boole’s  inequality:  the  probability  that  there  exists  a 
binary  solution  vector  for  m  randomly  drawn  patterns 
is  less  than  2"“”*,  and  if  m  exceeds  n,  this  probabil¬ 
ity  plunges  below  1/2.  The  lower  capacity  estimate  of 
the  order  of  n/log  n  follows  from  the  estimate  of  the 
capacity  of  the  Majority  Rule  algorithm:  by  construc¬ 
tion  of  the  Majority  Rule,  if  m  is  less  than  the  order 
of  n/log  n  then  there  exists  a  binary  solution  vector 
with  high  probability.  An  analysis  of  the  transition 
probability  matrix  for  Directed  Drift  indicates  that 
the  probability  that  the  system  stays  forever  among 
the  (finite)  set  of  transient  states  is  zero  when  there 
exists  at  least  one  solution  vector  (which  constitutes 
an  absorbing  state),  so  that  the  lower  capacity  is  at 
least  of  the  order  of  n/logn.  We  conjecture  that,  in 
fact,  the  capacity  of  the  algorithm  is  n. 

Simulations  indicate  that  there  are  two  distinct 
regimes  of  behaviour — a  regime  below  capacity  where 
convergence  is  very  rapid  (in  quadratic  time)  in  con¬ 
sonance  with  the  rough  analysis  above,  and  a  regime 
above  capacity  where  the  analytical  picture  is  much 
less  clear  and  where  convergence  takes  substantially 
longer.  An  abrupt  transition  around  the  capacity  of 
the  algorithmis  is  seen  between  the  two  regimes  of  con¬ 
vergence  time.  R.  Meir  has  recently  communicated 
to  us  that  in  Monte  Carlo  simulations  and  compar¬ 
isons  with  genetic  algorithms.  Directed  Drift  appears 
to  have  an  optimal  character  [Mei91].  A  slightly  more 
detailed  analysis  and  specifics  of  simulation  results  are 
included  in  [Ven91(b)]. 


5.3  SEVERAL  BIT  UPDATES 

The  algorithm  can  be  simply  extended  to  accommo¬ 
date  more  than  a  single  bit  update  per  epoch.  Let 
{Nt}  be  a  sequence  of  integers  with  0  <  Nt  <  n/2. 

Base;  w[1]  €  lEf'  is  chosen  arbitrarily. 
Iteration;  As  before,  updates  are  made 
only  if  the  current  pattern  from  the  training 
sequence  is  misclassified. 

•  If(w[t],u[t])  >  0,  then  the  weight  vector 
is  left  unchanged:  w[t  +  1]  =  w[t]. 

•  If  (w[t],u[t]}  <  0,  then  Nt  indices  ii(t], 

•  •  ■  I  Jjv,  [^]  sre  picked  at  random  from  the 
set  of  indices,  J[t],  of  mismatched  com¬ 
ponents.  The  new  weight  vector  is  now 
formed  according  to  the  following  rule:  if 
j  ^  {ii  W.  •  •  •  .  iw.ft)}  then  set  tnJt-H]  = 

u>^[t];  else  if  j  €  {ji[t] . i/v.W}  then 

set  u)j(t  +  1]  =  — «>i[t]. 

The  sequence  Nt  specifies  the  number  of  bits  to  be 
changed  at  each  update  epoch,  and  the  proper  choice 
of  this  sequence  is  clearly  critical  to  the  function¬ 
ing  of  the  algorithm.  This  is  analogous  to  choosing 
an  appropriate  cooling  schedule  for  simulated  anneal¬ 
ing  [KGV83].  Anecdotal  evidence  from  simulations  in¬ 
dicates  that  significant  improvements  in  convergence 
can  be  obtained  over  single  bit  updates  by  appropri¬ 
ate  choices  of  the  sequence  Nt . 

6  CONCLUSIONS 

The  investigations  reported  here  constitute  initial  for¬ 
ays  into  two  areas:  (1)  using  randomisation  as  a  tool 
in  the  development  of  efficient  learning  algorithms  for 
networks  with  binary  weights  (or,  more  generally,  dy¬ 
namic  range  constrained  wei^ts);  and  (2)  develop¬ 
ing  notions  of  probabilistic  capacity  which,  in  distri¬ 
bution  dependent  situations,  yield  results  on  sample 
complexities  for  learning  analogous  to  the  distribu¬ 
tion  free  results  that  derive  from  the  VC  dimension. 
Notwithstanding  the  theoretical  stumbling  blocks  in 
learning  binary  weights — intractable  worst  cases  may 
exist  as  a  consequence  of  the  NP-completeness  of  the 
problem — there  is  a  strong  practical  motivation  to  de¬ 
velop  learning  algorithms  for  this  case  because  of  the 
lower  cost  and  simplicity  of  circuits  comprised  of  bi¬ 
nary  interconnections.  The  success  (albeit  limited)  of 
the  randomised  algorithms  reported  here  suggest  that 
these  may  repay  further  investigation;  in  particular, 
we  might  be  able  to  hope  for  good  average  case  be¬ 
haviour  in  certain  regimes.  We  are  currently  inves¬ 
tigating  certain  extensions  of  these  ideas  in  networks 
with  more  complex  interconnectivity  patterns  than  the 
single  neuron  considered  here.  The  parallel  develop¬ 
ment  of  notions  of  distribution  dependent  capacity 


presented  here  in  brief  is  aimed  at  providing  a  bet¬ 
ter  understanding  of  (distribution  dependent)  prob¬ 
lems  where  practitioners  report  a  wide  gulf  between 
the  sample  complexities  needed  in  practice  and  those 
predicted  in  the  distribution  free  model. 

A  PERCEPTRON  TRAINING 

It  is  instructive  to  compare  convergence  rates  for  Di¬ 
rected  Drift  with  those  that  obtain  for  Perceptron 
Training.  Let  {u[t]}  be  a  training  sequence  of  pat¬ 
terns,  and  let  {w[t]}  denote  a  learning  sequence  of 
real  weight  vectors.  We  will  assume  that  there  exists 
a  binary  solution  vector  w*  €  IB".  (Perceptron  Train¬ 
ing  will,  in  general,  not  converge  to  a  binary  solution, 
however,  even  if  one  exists.) 

A.l  FIXED  INCREMENT  PERCEPTRON 
TRAINING 

This  is  the  simplest  form  of  Perceptron  Training.  Let 
y?  >  0  be  fixed. 

Base;  The  initial  choice  of  weight  vector 
is  arbitrary.  For  simplicity  we  take  w[l]  =  0. 
Iteration  ;  As  before,  weight  vector  updates 
are  made  only  if  a  pattern  is  misclassified. 

•  If(w[f],u[f])  >  0,  then  the  weight  vector 
is  left  unchanged:  w[t  +  1]  =  w[f]. 

•  If  (w[t],  u[t])  <  0,  then  set  w[l  -1-  1]  = 
w[t]-h4u[t]. 

Note  that  fixed  increment  Perceptron  lYaining  is  ho¬ 
mogeneous  and  on-line. 

We  now  cleum  that  the  procedure  will  converge  to  an, 
in  general,  non-binary  solution  with  a  worst-case  mis¬ 
take  bound  of  n’  if  there  exists  a  binary  solution  vec¬ 
tor.  Let  w*  €  B”  be  a  binary  solution  vector,  and 
as  before,  let  {ft}  denote  the  subsequent,  of  epochs 
at  which  patterns  from  the  training  sequence  are  mis¬ 
classified;  i.e., 

(w[ft],u[ft])  <  0,  k  =  l,2,....  (5) 

Set  w[l]  =  0,  and,  for  a  value  of  parameter  c  >  0  to 
be  specified,  consider  the  estimate  errors 

^t  +  l  =  l|w[ft+i]-cw*|l*. 

Using  (5),  a  standard  inductive  argument  then  yields 
the  bounds 

0  <  .^k+i  <  —  0(2c  —  0n)k. 

We  hence  have  the  worst-case  mistake  bound 

-  0{2c-0ny 

Minimising  the  bound  with  respect  to  c  yields  the 
upper  bound  for  the  mistake  bound.  (For  better  mis¬ 
take  bounds  see  [Lit88].)  The  fixed  increment  Percep¬ 
tron  'Raining  Algorithm  hence  terminates  after  O(n^) 
component  updates  if  a  binary  solution  vector  exists. 


A.2 


SINGLE  COMPONENT 
PERCEPTRON  TRAINING 


The  basic  randomisation  idea  behind  single  bit  update 
Directed  Drift  is  easily  extended  to  single  component 
Perceptron  Training,  where  a  single  component  of  the 
weight  vector  is  modified  at  each  update  epoch  (as 
opposed  to  fixed  increment  Perceptron  IVaining  where 
all  components  are  modified  at  each  epoch). 


For  each  epoch,  t,  let  I[t]  denote  the  subset  of  indices 
for  which  the  corresponding  components  of  w[f]  and 
u[t]  are  opposite  in  sign: 


/[t]  =  { i :  u,[t]  #  sgn  u)<[t] }. 


Base;  For  simplicity,  take  w[l]  =  0. 
Iteration;  Weight  updates  are  predicated, 
as  usual,  upon  whether  a  correct  or  incorrect 
response  is  obtained  at  the  current  epoch,  i. 

•  If  {w[t],  u[<])  >  0,  then  the  weight  vector 
is  left  unchanged:  w[t  +  1]  =  w(<]. 

If  (w(<],u[f])  <  0,  then  an  index  i(t]  is 


picked  at  random  from  the  set  of  indices, 
/[<],  of  mismatched  components.  The 
new  weight  vector  is  now  formed  accord¬ 
ing  to  the  following  rule: 


-  {  Wili]  +  ii[t] 


ifijii[t] 
ifi  =  i[t]. 
(6) 


Just  as  for  Directed  Drift,  single  component  Percep¬ 
tron  Training  is  local,  non-homogeneous,  randomised, 
and  on-line. 


As  before,  we  can  heuristically  estimate  the  average- 
case  performance  of  the  algorithm  by  appealing  to 
ideas  from  random  walks.  Let  (It }  satisfying  (5)  de¬ 
note  the  subsequence  of  epochs  at  which  patterns  from 
the  training  sequence  are  misclassihed.  Let  Ck+i  = 
||w[tt+i]||  denote  the  length  of  the  weight  vector  at 
epoch  tk+i.  By  definition  of  the  algorithm  we  induc¬ 
tively  obtain  the  upper  bound 


£2 


t+i 


<JCI  +  I<k, 


while  the  Cauchy-Schwarz  inequality  yields  the  lower 
bound 


^k+l  c. 


>  l(w[ft+i],w*)p  _ 


*112 


h=i 


Define  the  ±1  random  variables 

and  set  S*  =  Xh-  We  then  have  the  bounds 


<  £.«  <  Vi. 

y/n 


The  algorithm  terminates  at  the  first  instant  k  for 
which  |5t|  exceeds  y/kn.  Within  capacity  again,  the 
situation  is  reminiscent  of  a  random  walk  Sk  with  pos¬ 
itive  drift  E5t  =  2kfi„  >  2k /n,  and  absorbing  bound¬ 
aries  at  ±N/En.  We  refer  the  reader  to  [Ven91(b)]  for 
the  proof  of  the  following  result: 

Lemma  A.l  Lei  {Xj]  be  i.i.d.,  ±1  random  variables 
with  mean  EXi  =  2fin,m  ^  2/n,  and  lei  Sk  = 
Xj  denoie  a  random  walk  wiih  posiiive  drift 
EsI  =  2ib/3„,m  >  2fc/n.  Lei 

Ti  =  inf{i  :  15t|  >  VJm} 

denoie  ihe  iime  of  firsi  passage  io  ike  receding  (two- 
sided)  boundary  at  ±\/kn.  Then 

Using  ET2  as  a  rough  estimate  for  the  expected  mis¬ 
take  bound  (when  m  is  within  capacity),  we  get  the 
asymptotic  estimate  O(n^)  for  the  expected  mistake 
bound  of  single  component  Perceptron  Training  as 
n  — » 00. 
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What  Sparsity  Implies  to  Robustness  and 

Memory 
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Abstract 

Robustness  is  a  commonly  bruited  property  of  neural  networks;  in  particu¬ 
lar,  a  folk  theorem  in  neural  computation  asserts  that  neural  networks — in 
contexts  with  large  interconnectivity — continue  to  function  efficiently,  al¬ 
beit  with  some  degradation,  in  the  presence  of  component  damage  or  loss. 
A  second  folk  theorem  in  such  contexts  asserts  that  dense  interconnectiv¬ 
ity  between  neural  elements  is  a  sine  qua  non  for  the  efficient  usage  of 
resources.  These  premises  are  formally  examined  in  this  communication 
in  a  setting  that  invokes  the  notion  of  the  “devil”  ^  in  the  network  as  an 
agent  that  produces  sparsity  by  snipping  connections. 


1  ON  REMOVING  THE  FOLK  FROM  THE  THEOREM 

Robustness  in  the  presence  of  component  damage  is  a  property  that  is  commonly 
attributed  to  neural  networks.  The  content  of  the  following  statement  embodies 
this  sentiment. 

Folk  Theorem  1:  Computation  in  neural  networks  is  not  substantially 
affected  by  damage  to  network  components. 

While  such  a  statement  is  manifestly  not  true  in  general — witness  networks  with 
“grandmother  cells”  where  damage  to  the  critical  cells  fatally  impairs  the  com¬ 
putational  ability  of  the  network — there  is  anecdotal  evidence  in  support  of  it  in 


’Well,  maybe  an  imp. 


situations  where  the  network  has  a  more  “distributed”  flavour  with  relatively  dense 
interconnectivity  of  elements  and  a  distributed  format  for  the  storage  of  information. 
Qualitatively,  the  phenomenon  is  akin  to  holographic  modes  of  storing  information 
where  the  distributed,  non-localised  format  of  information  storage  carries  with  it  a 
measure  of  security  against  component  damage. 

The  flip  side  to  the  robust  folk  theorem  is  the  following  observation,  robustness 
notwithstanding; 

Folk  Theorem  2:  Dense  interconnectivity  is  a  sine  qua  non  for  efficient 
usage  of  resources;  in  particular,  sparser  structures  exhibit  a  degradation 
in  computational  capability. 

Again,  disclaimers  have  to  be  thrown  in  on  the  applicability  of  such  a  statement. 
In  recurrent  network  architectures,  however,  this  might  seem  to  have  some  merit. 
In  particular,  in  associative  memory  applications,  while  structural  robustness  might 
guarantee  that  the  loss  in  memory  storage  capacity  with  increased  interconnection 
sparsity  may  not  be  catastrophic,  nonetheless  intuitively  a  drop  in  capacity  with 
increased  sparsity  may  be  expected. 

This  communication  represents  an  effort  to  mathematically  codify  these  tenets.  In 
the  setting  we  examine  we  formally  introduce  sparse  network  interconnectivity  by 
invoking  the  notion  of  a  (puckish)  devil  in  the  network  which  severs  interconnection 
links  between  neurons.  Our  results  here  involve  some  surprising  consequences — 
viewed  in  the  light  of  the  two  folk  theorems — of  spairse  interconnectivity  to  robust¬ 
ness  and  to  memory  storage  capability.  Only  the  main  results  are  stated  here;  for 
extensions  and  details  of  proofs  we  refer  the  interested  reader  to  Venkatesh  (1990) 
and  Biswas  and  Venkatesh  (1990). 

Notation  We  denote  by  IB  the  set  {-1, 1}.  For  every  integer  k  we  denote  the  set 
of  integers  {1,2, .. .  ,  fc}  by  [A].  By  ordered  multiset  we  mean  an  ordered  collection 
of  elements  with  repetition  of  elements  allowed,  and  by  k-set  we  mean  an  ordered 
multiset  of  k  elements.  All  logarithms  in  the  exposition  are  to  base  e. 


2  RECURRENT  NETWORKS 

2.1  INTERCONNECTION  GRAPHS 

We  consider  a  recurrent  network  of  n  formal  neurons.  The  allowed  pattern  of 
neural  interconnectivity  is  specified  by  the  edges  of  a  (bipartite)  interconnectivity 
graph,  Gn,  on  vertices,  [n]  x  [n].  In  particular,  the  existence  of  an  edge  {»,  j}  in 
Gn  is  indicative  that  the  state  of  neuron  j  is  input  to  neuron  i.^  The  network  is 
characterised  by  an  n  x  n  matrix  of  weights,  W  =  [u;<j],  where  Wij  denotes  the 
(real)  weight  modulating  the  state  of  neuron  j  at  the  input  of  neuron  t.  If  u  €  IB" 
is  the  current  state  of  the  system,  an  update,  u,  >— ►  u[  of  the  state  of  neuron  i  is 


^Equivalently,  imagine  a  devil  loose  with  a  pair  of  scissors  snipping  those  interconnec¬ 
tions  for  which  {1,7}  fl  Gn-  For  a  complementary  discussion  of  sparse  interconnectivity 
see  Komlos  and  Paturi  (1986). 


specified  by  the  linear  threshold  rule 


“I=sgn[  ^  WijUj 

The  network  dynamics  describe  trajectories  in  a  state  space  comprised  of  the  vertices 
of  the  n-cube.^  We  are  interested  in  an  associative  memory  application  where  we 
wish  to  store  a  desired  set  of  states — the  memoTtes — as  fixed  points  of  the  network, 
and  with  the  property  that  errors  in  an  input  representation  of  a  memory  are 
corrected  and  the  memory  retrieved. 

2.2  DOMINATORS 

Let  u  G  IB"  be  a  memory  and  0  <  p  <  1  a  parameter.  Corresponding  to  the  memory 
u  we  generate  a  probe  u  €  IB"  by  independently  specifying  the  components,  iij ,  of 
the  probe  as  follows: 

_  /  probability  1  -  p  , . 

“  I  -uj  with  probability  p.  ^  ’ 

We  call  u  a  random  probe  with  parameter  p. 

Deftnition  2.1  We  say  that  a  memory,  u,  dominates  over  a  radius  pn  if.  with 
probability  approaching  one  as  n  — ►  oo,  the  network  corrects  all  errors  in  a  ran¬ 
dom  probe  with  parameter  p  in  one  synchronous  step.  We  call  p  the  (fractional) 
dominance  radius.  We  also  say  that  u  is  stable  if  it  is  a  0-dominator. 

Remarks;  Note  that  stable  memories  are  just  fixed  points  of  the  network.  Also, 
the  expected  number  of  errors  in  a  probe  is  pn. 

2.3  CODES 

For  given  integers  m  >  1,  n  >  1,  a  code,  AC”,  is  a  collection  of  ordered  multisets  of 
size  m  from  IB".  We  say  that  an  m-set  of  memories  is  admissible  iff  it  is  in  AC”  ."* 
Thus,  a  code  just  specifies  which  m-sets  are  allowable  as  memories.  Examples  of 
codes  include:  the  set  of  all  multisets  of  size  m  from  IB";  a  single  multiset  of  size 
m  from  IB";  all  collections  of  m  mutually  orthogonal  vectors  in  IB";  all  m-sets  of 
vectors  in  IB"  in  general  position. 

Define  two  ordered  multisets  of  memories  to  be  equivalent  if  they  are  permutations 
of  one  einother.  We  define  the  size  of  a  code,  AC”,  to  be  the  number  of  distinct 
equivalence  classes  of  m-sets  of  memories.  We  will  be  interested  in  codes  of  rela¬ 
tively  large  size;  log|AC”|/n  — ►  oo  as  n  — *  oo.  In  particular,  we  require  at  least 
an  exponential  number  of  choices  of  (equivalence  classes  of)  admissible  m-sets  of 
memories. 

^As  usual,  there  are  Liapunov  functions  for  the  system  under  suitable  conditions  on 
the  interconnectivity  graph  and  the  corresponding  weights. 

^We  define  admissible  m-sets  of  memories  in  terms  of  ordered  multisets  rather  than 
sets  so  as  to  obviate  certain  technical  nuisances. 


2.4  CAPACITY 


For  each  fixed  n  and  interconnectivity  graph,  G„,  an  algorithm,  X,  is  a  prescription 
which,  given  an  m-set  of  memories,  produces  a  corresponding  set  of  interconnection 
weights,  Wij,  i  G  [n],  \i,j}  G  G„.  For  m  >  1  let  A(u^,...  ,u'")  be  some  attribute 
of  m-sets  of  memories.  (The  following,  for  instance,  are  examples  of  attributes  of 
admissible  sets  of  memories:  all  the  memories  are  stable  in  the  network  generated 
by  X\  almost  all  the  memories  dominate  over  a  radius  pn.)  For  given  n  and  m,  we 
choose  a  random  m-sei  of  memories,  u',  . . . ,  u*”,  from  the  uniform  distribution  on 


Definition  2.2  Given  interconnectivity  graphs  Gn,  codes  IC^,  and  algorithm  X, 
a  sequence,  {Gn}^i,  is  a  capacity  function  for  the  attribute  A  (or  A-capacity  for 
short)  if  for  A  >  0  arbitrarily  small: 

a)  P  {-4(u\  . . .  ,  u"*)}  — *  1  as  n  — ►  oo  whenever  m  <  (1  -  A)G„; 

b)  P  {.4(u^, .. .  ,u'")}  — *  0  as  n  — ♦  oo  whenever  m  >  (1  +  A)Cn. 

We  also  say  that  C„  is  a  lower  A-capactty  if  property  (a)  holds,  and  that  C„  is  an 
upper  A-capactty  if  property  (b)  holds. 

For  m  >  1  let  u\  . . .,  o'"  G  IB”  be  an  m-set  of  memories  chosen  from  a  code 
The  outer-product  algorithm  specifies  the  interconnection  weights,  Wij,  according 
to  the  following  rule:  for  t  G  [n],  {».i}  €  G„, 

(2) 

In  general,  if  the  interconnectivity  graph,  Gm  is  symmetric  then,  under  a  suitable 
mode  of  operation,  there  is  a  Liapunov  function  for  the  network  specified  by  the 
outer-product  algorithm.  Given  graphs  Gn,  codes  )C^,  and  the  outer-product  algo¬ 
rithm,  for  fixed  0  <  p  <  1/2  we  are  interested  in  the  attribute  Vp  that  each  of  the 
m  memories  dominates  over  a  radius  pn. 

3  RANDOM  GRAPHS 

We  investigate  the  effect  of  a  random  loss  of  neurad  interconnections  in  a  recurrent 
network  of  n  neurons  by  considering  a  random  bipartite  interconnectivity  graph 
RGn  on  vertices  [n]  x  [n]  with 

P  {{i,j}£RG„}  =  p 

for  all  I  G  [n],  j  €  [«],  and  with  these  probabilities  being  mutually  independent. 
The  interconnection  probability  p  is  called  the  sparsity  parameter  and  may  depend 
on  n.  The  system  described  above  is  formally  equivalent  to  beginning  with  a  fully- 
interconnected  network  of  neurons  with  specified  interconnection  weights  Wij,  and 
then  invoking  a  devil  which  randomly  severs  interconnection  links,  independently 
retaining  each  interconnection  weight  wtj  with  probability  p,  and  severing  it  (re¬ 
placing  it  with  a  zero  weight)  with  probability  q  =  1  —  p. 


Let  CK'^  denote  the  complete  code  of  ail  choices  of  ordered  multisets  of  size  m  from 
B". 


Theorem  3.1  Let  0  <  p  <  1/2  be  a  fixed  dominance  radius,  and  let  the  sparsity 
parameter  p  satisfy  pn^  — ►  oo  as  n  — *  oo.  Then  (1  —  2p)^pnf2\ogpn^  is  a  Vp- 
capacity  for  random  interconnectivity  graphs  RGn,  complete  codes  ,  and  the 
outer-product  algorithm. 

Remarks:  The  above  result  graphically  validates  Folk  Theorem  1  on  the  fault- 
tolerant  nature  of  the  network;  specifically,  the  network  exhibits  a  graceful  degra¬ 
dation  in  storage  capacity  as  the  loss  in  interconnections  increases.  Catastrophic 
failure  occurs  only  when  p  is  smaller  than  logn/n:  each  neuron  need  retain  only  of 
the  order  o/n(logn)  links  of  a  total  ofn  possible  links  with  other  neurons  for  useful 
associative  properties  to  emerge. 


4  BLOCK  GRAPHS 

One  of  the  simplest  (and  most  regular)  forms  of  sparsity  that  a  favourably  disposed 
devil  might  enjoin  is  block  sparsity  where  the  neurons  are  partitioned  into  disjoint 
subsets  of  neurons  with  full-interconnectivity  within  each  subset  and  no  neural 
interconnections  between  subsets.  The  weight  matrix  in  this  case  takes  on  a  block 
diagonal  form,  and  the  interconnectivity  graph  is  composed  of  a  set  of  disjoint, 
complete  bipartite  sub-graphs. 

More  formally,  let  1  <  6  <  n  be  a  positive  integer,  and  let  {/i, . . .  ,  In/h)  partition 
fn]  such  that  each  subset  of  indices.  It,  k  =  1,  ...,  n/b,  has  size  |7t|  =  6.^  We  cadi 
each  It  a  block  and  6  the  block  sixe.  We  specify  the  edges  of  the  (bipartite)  block 
interconnectivity  graph  BG„  by  {«,  j}  €  BG„  iff  i  and  j  lie  in  a  common  block. 

Theorem  4.1  Let  the  block  size  b  be  such  that  b  =  n(n)  as  n  oo,  and  let 
0  <  p  <  1/2  be  a  fixed  dominance  radius.  Then  (1  —  2p)^6/21og6n  is  a  Vp-capacity 
for  block  interconnectivity  graphs  BGn,  complete  codes  CK’f) ,  and  the  outer-product 
algorithm. 

Corollary  4.2  Under  the  conditions  of  theorem  ).!  the  fixed  point  memory  capacity 
IS  b/2\og  bn. 

Corollary  4.3  For  a  fully-interconnected  graph,  complete  codes  CK'f) ,  and  the 
outer-product  algorithm,  the  fixed  point  memory  capacity  is  n/41ogn. 

Corollary  4.3  is  the  main  result  shown  by  McEliece,  Posner,  Rodemich,  and 
Venkatesh  (1987).  Theorem  4.1  extends  the  result  aind  shows  (formally  validat¬ 
ing  the  intuition  espoused  in  Folk  Theorem  2)  that  increased  sparsity  causes  a  loss 
in  capacity  if  the  code  is  complete,  i.e.,  all  choices  of  memories  are  considered  ad¬ 
missible.  It  is  possible,  however,  to  design  codes  to  take  advantage  of  the  sparse 
interconnectivity  structure,  rather  at  odds  with  the  Folk  Theorem. 

^Here,  as  in  the  rest  of  the  paper,  we  ignore  details  with  regard  to  integer  rounding. 


Without  loss  of  generality  let  us  assume  that  block  Ii  consists  cf  the  first  6  indices, 
[6],  block  I2  the  next  6  indices,  [26]  — [6],  and  so  on,  with  the  last  block  consisting 
of  the  last  6  indices,  [n]  —  [n  —  6].  We  can  then  partition  any  ve>-tor  ii  €  IB”  as 


(3) 


\  «n/t  / 

where  for  ib  =  1,  . . . ,  n/6,  Ufc  is  the  vector  of  components  corresponding  to  block  7*. 
For  M  >  1  we  form  the  block  code  as  follows:  to  each  ordered  multiset  of 

M  vectors,  ,  . . . ,  from  IB",  we  associate  a  unique  ordered  multiset  in  BK^  ' 
by  lexicographically  ordering  all  vectors  of  the  form 

Ofl,  02,  •  •  •  ,  On/l  G  [A^l- 

Thus,  we  obtain  an  admissible  set  of  memories  from  any  ordered  multiset 

of  M  vectors  in  IB"  by  “mixing”  the  blocks  of  the  vectors.  We  call  each  Af-set  of 
vectors,  u^,  . . . ,  6  IB",  the  generating  vectors  for  the  corresponding  admissible 

set  of  memories  in  . 


Example:  Consider  a  case  with  n  =  4,  block  size  6=2,  and  M  =  2  generating 
vectors.  To  any  2-set  of  generating  vectors  there  corresponds  a  unique  4(=M"/‘)-set 
in  the  block  code  as  follows: 


Theorem  4.4  Let  0  <  p  <  1/2  be  a  fixed  dominance  radius.  Then  we  have  the  fol¬ 
lowing  capacity  estimates  for  block  interconnectivity  graphs  BGn,  block  codes  BK^ , 
and  the  outer-product  algorithm: 

a)  If  the  block  size  6  satisfies  nloglog6n/61og6n  — ►  0  as  n  —  00  then  the 
Vp-capacity  is 

Ul-2p)HY'‘’ 

2  log  6n 


b)  Define  for  any  1/ 


If  the  block  size  b  satisfies  6/logn  —  oo  and  6  login/ log  log  6n  =  0{n)  as 
n  -*  oo,  then  Cn(t')  is  a  lower  V ^-capacity  for  any  choice  of  u  <  3/2  and 
Cn(i')  is  an  upper  V ^-capacity  for  any  u  >  3/2. 

Corollary  4.5  If,  for  fixed  t  >  I,  we  have  b  =  n/t,  then,  under  the  conditions  of 
theorem  4-4i  "Dp-capacity  is 

Corollary  4.6  For  any  fixed  dominance  radius  0  <  p  <  1/2,  and  for  any  r  <  I,  a 
constant  c  >  0  and  a  code  of  size  Q  ^2*"^  con  be  found  such  that  it  is  possible 
to  achieve  lower  "Dp-capacities  which  are  Q  (2"  )  in  recurrent  neural  networks  with 
interconnectivity  graphs  of  degree  0 

Remarks;  If  the  number  of  blocks  is  kept  fixed  as  n  grows  (i.e.,  the  block  size 
grows  linearly  with  n)  then  capacities  polynomial  in  n  are  attained.  If  the  num¬ 
ber  of  blocks  increases  with  n  (i.e.,  the  block  size  grows  sub-hnearly  with  n)  then 
super-polynomial  capacities  are  attained.  Furthermore,  we  have  the  surprising  re¬ 
sult  rather  at  odds  with  Folk  Theorem  2  that  very  large  storage  capacities  can 
be  obtained  at  the  expense  of  code  size  (while  still  retaining  large  code  sizes)  in 
increasingly  sparse  networks. 
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Robustness  in  Neural  Computation:  Random 
Graphs  and  Sparsity 

Santosh  S.  Venkatesh,  Member,  IEEE 

>4  ftf/racl— Robustness  is  a  commonly  bruited  property  of  neural  net¬ 
works;  in  particular,  a  folk  theorem  in  neural  compulation  asserts  that 
fully-interconnected  neural  networks  continue  to  function  efficiently  in 
the  presence  of  component  damage.  This  communication  is  an  effort  to 
mathematically  codify  this  belief.  Component  damage  is  introduced  in  a 
fully-interconnected  neural  network  model  of  it  neurons  by  randomly 
deleting  links  between  neurons.  An  analysis  of  tbe  outer-product  algo¬ 
rithm  for  this  random  graph  model  of  sparse  interconnectivity  using  a 
simple  generalisation  of  Chebyshev's  inequality  yields  the  following 
main  result:  if  the  probability  of  losing  any  given  link  between  two 
neurons  is  1  -  p,  then  the  outer  product  algorithm  can  store  of  the 
order  of  pn /log  pn^  stable  memories  correcting  a  linear  number  of 
random  errors.  In  particular,  the  average  degree  of  the  interconnectivity 
graph  dictates  the  memory  storage  capability,  and  functional  storage  of 
memories  as  stable  stales  is  feasible  abruptly  when  the  average  number 
of  neural  interconnections  retained  by  a  neuron,  exceeds  the  order  of 
log  n  links  (of  a  total  of  n  possible  links)  with  other  neurons.  This  work 
complements  the  results  of  Komids  and  Paturi  on  worst  case  error 
correction  for  fixed  underlying  inlerconnectiviiy  graphs. 

Index  Terms— Neural  networks,  robustness,  random  graph,  sparsity, 
outer-product  algorithm. 

/ 

I.  Introduction 

A.  The  Problem 

Robustness  in  the  presence  of  component  damage  is  a  property 
that  is  common  attributed  to  neural  networks.  The  content  of  the 
following  statement  embodies  this  sentiment. 

Folk  Theorem:  Computation  in  neural  networks  is  not  substan¬ 
tially  affected  by  damage  to  network  components.  - 

« 

While  such  a  statement  cannot  hold  true  in  general— witness  net¬ 
works  with  "grandmother  cells”  where  damage  to  the  critical  cells 
fatally  impairs  the  computational  ability  of  the  network— there  is 
anecdotal  evidence  in  support  of  it  in  situations  where  the  network 
has  a  more  “distributed"  flavor  with  a  relatively  dense  interconnec¬ 
tivity  of  elements.  In  such  situations,  experimental  evidence  indi¬ 
cates  that  networks  of  neural  elements  do  indeed  possess  a  measure 
of  fault-tolerance  [1].  (^alitatively.  the  phenomenon  is  akin  to 
holographic  modes  of  storing  information  where  the  distributed, 
nonlocalized  format  of  information  storage  carries  with  it  a  measure 
of  security  against  component  damage. 

Neural  models  for  associative  memory  are  natural  candidates  for 
investigation  of  fault-tolerant  properties.  These  models  typically 
consist  of  a  fully-interconnected  network  of  formal  neurons  Oinear 
threshold  elements).  Information  is  stored  in  these  models  in  the 
interconnections  between  neural  elements. 
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The  outer-product  algorithm,  which  we  described  in  the  sequel, 
is  a  particularly  simple  algorithmic  prescription  for  storing  memo¬ 
ries  in  a  fully-interconnected,  recurrent  neural  network.  The  algo¬ 
rithm  has  good  associative  properties,  and  consequently,  has  been 
the  subject  of  some  searching  mathematical  investigations:  McEliece 
et  at.  [2]  showed  that  the  algorithm  can  store  of  the  order  of 
n  /log  n  memories  with  correction  of  a  linear  number  of  random 
errors;  subsequent  investigations  by  Komlos  and  Paturi  [3]  showed 
that  the  storage  capacities  derived  by  McEliece  et  al.  persist  even  in 
the  case  of  worst  case  errors;  complementary  results  due  to  New¬ 
man  [4]  indicate  that  storage  capacities  linear  in  n  can  be  achieved 
in  the  outer-product  algorithm  if  errors  can  be  tolerated  in  the  recall 
of  the  memories.  Nonrigourous  results  qualitatively  similar  to  those 
above  have  also  been  reported  by  Hopfleld  [1]  and  Amit  et  al.  [5]. 

We  investigate  robustness  in  the  model  by  invoking  a  devil  (well, 
maybe  an  imp)  which  randomly  severs  interconnection  links  in  a 
hilly  interconnected  network  of  n  neurons,  with  weights  specified 
by  the  outer-product  algorithm.  The  sparse  network  that  results  is 
essentially  specified  by  an  underlying  random  interconnectivity 
graph.  The  following  are  our  main  rt.ults,  which  provide  a  graphic 
validation  of  the  folk  theorem  in  this  instance. 

If  the  probability  of  retaining  any  given  link  between  two  neurons 
is  p,  then  the  outer-product  algorithm  can  store  of  the  order  of 
pn  /log  pn^  stable  memories  with  correction  of  a  linear  number  of 
random  errors.  Functional  storage  of  memories  as  stable  states  is 
feasible  when  the  average  degree  of  the  random  intercotmectivity 
graph  exceeds  the  order  of  log  n;  memories  will  be  stable  with 
respect  to  a  linear  number  of  random  errors  in  components  if  the 
average  degree  of  the  random  interconnectivity  graph  exceeds  the 
order  of  log’ «. 

These  results  are  consistent  with  results  of  Koml6s  and  Paturi  [6] 
who  have  analysed  worst  case  errors  in  networks  with  interconnec¬ 
tivities  specified  by  fixed  underlying  graphs.  Using  sophisticated 
and  powerful  techniques  from  large  deviation  probability  theory 
they  show  results  on  convergence  times  and  the  radius  of  attraction 
within  which  all  points  are  attracted  to  the  memories  in  terms  of  the 
spectrum  of  the  underlying  graph.  The  random  graph  model 
analysed  here  provides  great  attendant  simplicity  in  the  analysis  of 
the  correction  of  random  errors.  In  fact,  as  we  will  see  in  the 
sequel,  the  main  results  fall  out  of  a  rather  simple  application  of 
Chebyshev’s  inequality. 

Notation:  We  denote  by  3  the  set  {-1,1}.  For  any  positive 
integer  k,  we  denote  by  [ A: ]  the  set  {!,•  •  ■,  k}.  All  logarithms  in 
the  exposition  are  to  the  Napier  base  e.  We  also  use  c, ,  Cj . "  ' ,  to 
denote  absolute  positive  constants.  We  invoke  standard  asymptotic 
notation  in  the  sequel;  in  addition,  if  {jr„}  is  a  positive  sequence 
and  {y„(f)}  is  another  positive  sequence  which  is  a  function  of  a 
real  parameter  t,  we  denote  >„(«)  =  0,(x„)  if,  for  every  fixed 
value  of  (,  we  can  find  K(t)  >  0  (independent  of  n)  such  that 
y,(*)  <  fC(()x„  for  every  n. 

B.  The  Setting 

We  consider  a  network  of  n  formal  neurons.  Each  neuron  in  the 
system  assumes  one  of  two  binary  states,  -  1  or  -F  1 ,  and  the 
network  as  a  whole  evolves  in  the  state  space,  3",  of  binary  vectors 
of  length  n.  Neural  inlerconnectivity  is  specified  by  a  random 
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bipartite  interconnectivity  graph  G„  on  vertices  (n]  x  (n]  with  i.i.d.  random  variables  with 


=p. 

for  all  and  with  these  probabilities  being  mutually 

independent.  The  interconnection  probability  p  is  called  the  spar¬ 
sity  parameter  and  may  depend  on  n.  A  real  interconnection 
weight  Wij  is  associated  with  each  edge  {i,  j)  €  Cr„.  We  adopt  the 
convention  w,.y  ■  0  if  The  state  of  each  neuron  is 

updated  based  on  the  sign  of  a  linear  form  computed  by  the 
interconnection  weights  and  the  current  state  of  the  system:  if 
u  €  IS"  is  the  current  state  of  the  system,  an  update,  u, »  u'  of  the 
state  of  the  ith  neuron  is  specified  by  the  linear  threshold  rule' 

x;  =  Sgn  (  Y.  =  sgn  (  E  W,jUj 

^J{i.j)eC.  I  \y.l 

Neural  updates  may  be  either  synchronous,  with  every  neuron  being 
updated  in  concert,  or  asynchronous,  with  at  most  one  neuron  being 
updated  at  any  instant. 

The  system  previously  described  is  formally  equivalent  to  begin¬ 
ning  with  a  fully-interconnected  network  of  neurons  with  specified 
interconnection  weights  Wjj,  and  then  invoking  a  devil  which  ran¬ 
domly  severs  interconnection  links,  independently  retaining  each 
interconnection  weight  wi^  probability  p,  and  severing  it 
(replacing  it  with  a  zero  weight)  with  probability  q  =  \  -  p.  Note 
that  the  expected  number  of  weights  retained  by  any  neuron  in  the 
network  is  pn,  and  the  expected  number  of  nonzero  weights  in  the 
network  is  pn^.^ 

C.  The  Algorithm 

As  in  any  recurrent  dynamical  system,  we  are  interested  in  the 
fixed  points  of  the  system.  In  particular,  we  focus  on  an  associative 
memory  application  where  we  wish  to  store  a  desired  set  of 
states— the  fundamental  memories— as  fixed  points  of  the  net¬ 
work,  and  with  the  property  that  errors  in  an  input  representation  of 
a  memory  are  corrected  and  the  memory  retrieved. 

Let  It',* ■ be  an  m-set  of  fundamental  memories 
whose  components,  Uj,  are  drawn  independently  from  a  sequence 
of  symmetric  Bernoulli  trials;  viz.,  for  j  =  1,* ■  *, n,  and  &- 
!.•••,  m. 


-  1 ,  with  probability  1  /2 , 
1 ,  with  probability  1  /2 . 


The  outer-product  algorithm  specifies  interconnection  weights,  iv,y, 
according  to  the  following  prescription;’  for  ie  (n),  y  €  [nj, 

[  51  J  *  '• 

=  I  is-i 

[o,  ify  =  /. 

In  our  sparse  interconnectivity  model,  each  weight  w^j  is  indepen¬ 
dently  severed  with  probability  q  =  ]  -  p,  and  retain^  with  prob¬ 
ability  p.  More  formally,  let  ir,j,  /6[n],  yeln]  be  a  sequence  of 


'  We  define  the  sgn  function  by  sgn  x  =  x/  j  x  |  for  all  x  ^  0  and 
sgnO-  I. 

We  could,  if  we  wished,  enforce  symmetry  in  the  sparse  network  by 
considering  the  links  between  neurons  as  bidirectional  so  that  severing  a  link 
automatically  produces  symmetric  zeroes  in  the  weight  matrix.  For  the 
purposes  of  this  correspondence  it  is  immaterial  which  random  graph  model 
we  select. 

’  Variations  are  pouible  with  diagonal  terms  w^,  *  0,  but  are  all  function¬ 
ally  equivalent. 


^jO.  if{/.y}^C„, 

I  I.  if{/.y}6C„. 


For  /e[n]  and  je[n\  we  can  now  define  the  interconnection 
weights  of  the  sparse,  random  network  by 


S  =  l 


0, 


if  y  ^  /, 
if  y  =  /. 


(0 


Hie  variables  tr,^  are  simply  the  indicator  random  variables  for  the 
edges  of  the  random  bipartite  interconnectivity  graph  C„. 


n.  Stable  Memories 

A  basic  requirement  that  we  would  like  to  impose  is  that  the 
memories  are  stable,  i.e.,  fixed  points  of  the  network; 

«“  =  sgn  I  E  j  •  i=l,**,n,  a=l,**-,m. 

We  begin  by  estimating  the  number  of  memories  that  can  be  made 
stable  in  the  outer  product  algorithm  for  a  random  interconnectivity 
graph  with  sparsity  parameter  p.  The  following  theorem  is  our  main 
result. 


Theorem  1:  Let  the  sparsity  parameter  p  satisfy  pn’  -»  *  as 
R  00.  For  any  fixed  e  >  0,  we  then  have  the  following. 


a)  If,  as  R  00,  we  choose  the  number  of  memories,  m,  such 
that 


m  T& 


pn 


2  log  pr’ 


1  + 


log  log  pn’  +  log2< 
log  pn’ 


then  the  probability  that  all  m  memories  are  fixed  points  is  at 
least  as  large  as  1  -  c  -  o(l). 

b)  If,  as  R  -►  00,  we  choose  the  number  of  memories,  m,  such 
that 


pn  f  ^  logt  ^  Q  /_i_y 

2log  R  log  n-  '\  log’  R  / 


(3) 


then  the  expected  number  of  memories  that  are  fixed  points  is 
at  least  as  large  as  [1  -  «  -  o(l)]ni. 


Remarks:  In  particular,  we  can  store  at  least  pn  /2  log  pn’ 
memories  if  all  the  nKmories  are  required  to  be  stable,  and  at  least 
pnjl  log  R  memories  if  only  most  of  the  memories  are  required  to 
be  stable.  This  result  reduces  to  the  capacity  result  for  full  intercon¬ 
nectivity  of  McEliece  et  al.  [2]  if  we  set  p  =  1,  i.e.,  no  intercon¬ 
nections  are  severed. 

This  result  illustrates  graphically  the  fault-tolerant  nature  of  the 
network;  specifically,  the  network  exhibits  a  graceful  degradation 
in  storage  capacity  as  the  loss  in  interconnections  increases.  Mem¬ 
ory  storage  is  achieved  if  the  sparsity  parameter,  p,  is  at  least  of  the 
order  of  log  r/r,  i.e.,  each  neuron  retains  essentially  of  the  order 
of  log  R  weights  out  of  its  original  complement  of  n  weights.  In 
particular,  if  p  =  A  log  r /r  then  the  network  can  store  at  least 
K /2  memories;  if  p  =  n~’  for  any  0  s  r  <  1  then  the  network 
can  store  at  least  r'  ‘  '/2(2  -  r)  log  r  memories;  if  p  is  equal  to  a 
constant  0  <  c  S  I  then  the  network  can  store  al  least  cn  /4  log  n 
memories.  As  a  graphic  example,  the  network  can  loose  half  its 
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interconnections  with  essentially  no  change  in  the  storage  character¬ 
istics.  \f  p  =  o(loS  n/n),  i.e.,  the  average  degree  of  the  intercon¬ 
nectivity  graph  is  o(log  n),  we  should,  of  course,  expect  catas¬ 
trophic  failure  of  the  memory. 

Proof:  Let  us  define  the  doubly  indexed  random  variables, 
XT.  by 

n 

X!"  =  WfjUj,  /  0=1, •••,»>. 

J-\ 

It  is  readily  seen  that  >  0  implies  that  the  ith  component  of  the 
oth  memoty  is  stable.  Thus,  we  will  require  that  X^  >  0  for  each 
/E  [/>]  and  o  E  [m]  if  each  of  the  memories  is  to  be  a  fixed  point  of 
the  network. 

Let  us  first  consider  the  requirements  that  must  be  satisfied  for  a 
single  component  of  a  memory  to  be  fixed.  Substituting  for  the 
weights,  Wjj,  from  (1)  we  have 

xr  =  E  E  =  E  +  E  2rf] . 

J^i  0^1  J*i  '  $*a  > 


where  we  define 


Now,  for  every  real  r  we  have  cosh  r  i  e'^  Hence, 

P{  <  0}  S  inf  [  ^  J  +  ^1^ 

=  [1  -p(l  (4 

Recalling  that  M  =  m  -  1  it  is  easy  to  verify  that  for  /n  >  1 


1  _  g-U2M 


1  3  /  1  \  1 

2m  \  m’ /  2m 


It  follows  that 


We  hold  the  indices  /  and  a  fixed  for  the  nonce,  and  for  notational 
simplicity  suppress  the  i  and  a  dependence  of  both  Xf  and  ^ 
We  will  need  the  following  result  which  estimates  the  probability 
that  a  single  component  of  a  memory  is  not  stable. 

Lemma  I:  If,  as  /i  ->  oo,  the  parameters  p  and  m  vary  such  that 
p\/n  Im  -*  0,  then 

/•{AsO}  s  (l -K)(l)]exp|-^|  («-*«). 

Proof:  For  fixed  /  and  a,  the  random  variables  are  i.i.d. 
and  symmetric,  and  take  on  the  values  - 1  and  1  with  equal 
probability  1/2.  (This  follows  from  the  fact  that  the  memory 
components  are  i.i.d.,  symmetric  ±  1  random  variables,  and  that  the 
distinct  component  w*  appears  solely  in  the  expression  for  Zf.) 
Applying  the  generalized  Chebyshev  inequality  of  Lemma  Al,  we 
have  the  following  estimate  for  the  probability  that  there  is  an  error 
in  the  retrieval  of  a  single  component  of  a  memory: 

P{AsO}  s  inf 
rsO 

=  inf  £  exp  I  -  53  '’T.yf  I  +  E  ^j]  } 

=  inf  £  n  +  E  • 

raO  1  \  g»a  'I 

The  (1,0)  random  variables,  v,j,  j  *■  i  are  i.i.d.,  as  are  the  ±  1 
random  variable^  Zf,  j  *  i,  *  a.  It  follows  that  the  terms  in  the 
product  are  also  i.i.d.  random  variables.  For  notational  simplicity, 
v/e  set  M  =  m  —  1  and  N  =  n  -  1 .  We  now  have 

/>{As  0}  s  inf  £exp  +  ^5/"))] 

=  inf  (  pc‘'(£c‘'^’)  +  ql 

fiO  I  ‘ 

=  inf  f  pc’ '(cosh  +  q]  . 


To  obtain  the  last  equality  we  have  used  the  Taylor  series  approxi- 


log(l  -x)  =  -jr- 0{x^)  (Jr-^O) 

and  recalled  that  N  =  n  -  1 .  The  condition  on  p  and  m  completes 
the  proof.  □ 

The  probability,  that  one  or  more  components  of  any  of  the 
memories  is  not  stable  can  be  readily  estimated  by  an  application  of 
the  union  bound  and  (4): 

S  nmP{X  <  0}  S  nm[  -F  q]"  '. 

Note  that  the  bound  of  (4)  holds  for  all  choices  of  p,  m,  and  n,  so 
that  the  above  estimate  for  also  holds  unrestricted.  It  is  clear 

that  the  upper  bound  for  increases  monotonically  as  m  in¬ 

creases,  so  it  suffices  to  prove  the  theorem  with  inequality  replaced 
by  equality  in  (2)  and  (3).  Now,  with  m  chosen  as  in  (2)  the 
condition  on  p  and  m  in  Lemma  1  is  satisfied.  Hence,  for  this 
choice  of  m 

^  S  [l  -F  o(l)]nmexp  s«-Fo(l)  (n-»oo). 

This  establishes  part  a)  of  the  theorem. 

In  similar  fashion  we  can  establish  the  second  part  of  the  theorem 
by  noting  that  the  probability  that  a  given  memory  is  not  a  fixed 
point  is  bounded  from  above  by  nP{X  £  0}  by  the  union  bound. 
For  a  choice  of  m  according  to  (3)  this  probability  is  bounded  above 
by  (  -F  0(1).  Part  b)  of  the  theorem  follows  as  the  expected  number 
of  memories  that  are  not  fixed  points  is  just  m  times  the  probability 
that  one  memory  is  not  fixed. 

ni.  Error  Correction 

Let  us  now  investigate  how  sparsity  in  the  model  affects  the 
ability  of  the  system  to  retrieve  fundamental  memories  from  probes 
which  are  "noisy"  versions  of  the  memories.  The  particular  model 
of  error  correction  that  we  will  investigate  is  the  ability  of  the 
(sparse)  network  to  correct  random  errors  in  the  memories  in  one 
synchronous  step.  As  we  will  see,  the  moment  inequality  technique 
of  the  previous  section  still  serves  to  analyse  this  situation,  albeit  at 
the  cost  of  some  additional  complexity. 

Let  0  <  p  <  1/2  be  fixed.  Corresponding  to  each  memory  . 
we  generate  a  random  probe,  u°  e  U",  by  independently  specifying 
the  components.  of  the  probe  as  follows: 

with  probability  1  -  p, 

^  (5) 

-  u"  with  probability  p. 
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Note  that  the  expected  number  of  errors  in  the  probe  (i.e.,  the 
expected  number  of  components  of  the  probe,  it",  which  are  not 
equal  to  the  corresponding  components  of  the  memory,  m")  is  pn. 

Definition  1:  We  say  that  a  memory,  u“,  dominates  over  a 
radius  pn  if,  with  probability  approaching  one  as  n  —  a>,  the 
network  corrects  all  the  errors  in  a  random  probe  generated  accord¬ 
ing  to  prescription  (5)  in  one  synchronous  step.  We  call  p  the 
(fractional)  radius  of  dominance  of  the  memory. 

Remarks:  An  application  of  Lemma  A. 2  in  the  appendix  yields 
that  for  any  5  >  0  there  is  a  large  enough  constant  C  such  that  with 
probability  1  -  5  the  number  of  errors  in  the  probe  lies  between 
pn  -  C'Jn  and  pn  -F  Cv/n.  Hence,  a  memory  that  dominates 
over  a  radius  pn  corrects  random  errors  in  essentially  pn  compo¬ 
nents  with  high  probability. 

An  alternative— and  perhaps  more  appealing— model  for  generat¬ 
ing  random  probes  is  to  choose  the  probe  at  random  from  the 
Hamming  ball  of  radius  pn  surrounding  the  memory.  The  notion  of 
a  radius  of  dominance  for  the  memory  is  intuitively  and  geometri¬ 
cally  much  clearer  for  this  model.  However,  by  the  sphere  harden¬ 
ing  effect,  almost  all  probes  generated  in  this  model  are  concen¬ 
trated  at  the  surface  of  the  Hamming  ball  so  that  the  number  of 
errors  is  again  essentially  pn.  The  analytical  results  that  derive  for 
this  model  are  formally  indistinguishable  from  the  model  we  have 
adopted  in  (5).  The  present  format  is,  however,  slightly  more 
convenient  mathematically. 

We  will  prove  the  following  theorem  which  is  our  main  result  of 
this  section. 

Theorem  2:  Let  0  S  p  <  1/2  be  any  desired  radius  of  domi¬ 
nance,  arid  let  the  sparsity  parameter,  p,  satisfy  p  =  OOog''  n/n) 
for  some  fixed  y  >  3.  For  any  «>  0.  we  then  have  the  following. 

a)  U,  as  n  —  »,  we  choose  the  number  of  memories,  m,  such 
that 


m  s 


(1  -  2p)^pn 
2  log  pn^ 


log  log  pn^  +  log2«/(l  -  2p)^ 
log  pn^ 


I  log  log  pn^ 
(  log^  pn^ 


.  (6) 


then  the  probability  that  all  m  memories  dominate  over  a 
radius  pn  is  at  least  as  large  as  1  -  <  -  o(l). 
b)  If,  as  n  -*  00,  we  choose  the  number  of  memories,  m,  such 
that 


Let  us  form  the  random  sums 

n 

/=l,--,n.  o=l.  -,w.  (8) 

If  random  errors  are  to  be  corrected  in  one  synchronous  step  for 
each  memory  we  will  require  that  >  0  for  each  /e[«l  and 
ae(m]  with  high  probability.  Let  us  first  estimate  the  probability 
that  a  particular  component  of  a  memory  is  not  retrieved  in  one 
synchronous  step  from  a  random  probe.  We  again  hold  /  and  a 
fixed  and  suppress  the  dependence  of  variables  on  these  indices 
except  where  required  for  clarity. 

Substituting  for  the  weights,  iv,y,  from  (1)  in  (8)  we  have 


m 


x  =  E  E  =  y  +  E  ^ij 

Ezf. 

(9) 

jri  9-1  j*i 

where  we  define 

Zf^ufuJufuJ,  j*i.  e=i,- 

and 

J^i  j^i 

(10) 

We  are  interested  in  estimating  the  probability  that  .Y  <  0,  i.e.,  the 
probability  that  the  tth  component  of  memory  u"  is  not  retrieved 
from  the  random  probe  u"  in  one  synchronous  step.  The  following 
is  the  central  result. 

Lemma  2:  Let  0  ^  p  <  1  /2  be  any  desired  fractional  radius  of 
dominance,  and  let  r  be  a  fixed  parameter  with  2/3  <  r  <  1.  If,  as 
n  oe,  the  sparsity  parameter,  p,  and  the  number  of  memories, 
m,  vary  such  that  pn  -*  oo  and  m  =  0((  pn)')  then 

P{YsO}  s  [l +o(l))exp|--^^ — I  (n-oo). 

(»1) 

Proof:  The  demonstration  is  in  three  parts.  We  first  show  that 
the  sum  over  the  index  j  in  (9)  can  be  formally  replaced  by  a  sum 
over  essentially  pn  indices;  we  next  show  that  the  random  variable 
Y  can  be  formally  replaced  by  the  fixed  value  (1  -  2p)pn;  we 
finally  invoke  the  inequality  involving  the  moment  generating  func¬ 
tion  described  in  the  previous  section  to  complete  the  proof. 

Let  y  E  [n]  \  {»}  be  the  random  subset  of  indices  defined  by 

J={j  »,v=  !}■ 


m  s 


(1  -  2pY  pn 


2  log  n 


log  f 

1  +  -  +  O, 

log  n 


(log  n) 


(7) 


We  then  have 


y+E  Ezf. 

JeJ  0*a 


(12) 


then  the  expected  number  of  memories  that  dominate  over  a 
radius  pn  is  at  least  as  large  as  [1  -  r  -  o(i)]ni. 

Remarks:  We  can  store  at  least  (1  -  2p)^pn/21og  pn*  memo¬ 
ries  all  of  which  dominate  over  a  radius  pn,  and  at  least  (1  - 
2p)*pn/2Iog  n  memories  most  of  which  dominate  over  a  radius 
pn.  These  lower  estimates  of  capacity  are  also  tight  from  above. 
This  can  be  demonstrated  extending  the  technique  used  by  McEliece, 
et  al.  [2].  The  proof,  as  in  the  original,  is  long  and  replete  with 
technical  deuils.  We  will  not  go  into  it  here. 

Proof:  We  will  first  estimate  the  probability  that  a  single 
component  of  a  memory  is  retrieved  from  a  random  probe.  The  use 
of  the  union  bound,  as  before,  will  then  complete  the  proof  of  the 
theorem. 


Let  the  random  variable  A  =  |  y  j  denote  the  cardinality  of  J. 
Clearly,  A  =  It  follows  that 

£(A)  =  pN, 

where  we  set  Af  =  n  -  1  as  before.  Let  6  be  chosen  such  that 
(1  -  t)/2  <  6  <  1/6.  An  application  of  Lemma  A2  yields 

P{|A  -  pNl  >  (pA^)'^'"*}  =  (13) 

Now,  from  (10)  we  have 

jfj 

By  independence  of  the  components  of  the  memories,  the  expecta¬ 
tion  of  Y  conditioned  upon  a  sample  realisation  of  the  random  set  of 


1118 


IEEE  TRANSACTIONS  ON  INFORMATION  THEORY,  VOL.  38.  NO  3.  MAY  1992 


indices  J  depends  only  on  the  cardinality.  A,  of  J.  Hence, 

N 

EY=  x;  |A  =  =  *} 

*-o 

=  -  2p)(^)(l  -  =  (1  -  2p)pN. 

Using  (13)  and  the  large  deviation  Lemma  A2  hence  yields 

P{\Y-  (1  -  2p)pN\>  {pSy^**)  =  (14) 

Let  y  be  the  set  of  sample  points  over  which  the  following 
inequalities  hold  jointly; 

|A-pN| 

\Y-  {\-2p)pN\^(pNy^*\ 

From  (13)  and  (14),  we  then  have 

/»{y}  =  1  -  (15) 

We  say  that  an  assignment  of  values  to  A  and  Y  is  allowable  if 
they  occur  in  y .  A  subset  of  indices  from  [n]  \  {/}  is  allowable  if 
the  number  of  indices  in  the  set  is  allowable. 

Let  us  now  return  to  a  consideration  of  (12).  Using  (IS)  we  have 
from  elementary  considerations  that 

/*{ ^ s  0}  = /»( i;  E  z/s  -y) 

jeJ  P* a  J 

=  ^(  E  E  s  -  y  yj  (i6) 

[j^e*a  I 

Let  £  [n]  \  {/}  be  any  subset  of  indices,  and  let  X  =  |  J'  |. 
For  positive  X  and  j>  define 

/(>‘.3’)  =  We 

Applying  Lemma  A1  as  in  the  last  section,  we  have 

/(X,  y)  s  inf  s  n>M 

rtO 

Now  consider  a  choice  of  X  =  pM  ±  0{(pN)''^**)  and  y  =  (1  - 
2p)pN  ±  0((pAf)'^**).  Recalling  that  N  =  n  -  \,  M  =  m  -  1, 
and  that  from  the  statement  of  the  lemma  pn  —  oo  and  m  = 
0((pn)')  *  for  2/3  <  t  <  I,  we  have 


(17) 

The  last  equality  follows  from  the  choice  (I  -  r )/2  <  6  <  1/6;  this 
yields  1/2  +  6  <  2/3  <  r  so  that  by  choice  of  m  =  fl((pny)  we 
have  (pn)"^**  =  o(m). 

Returning  to  (16)  we  note  that  the  random  variables  2^  are 
independent  of  the  random  variable  Y  and  the  random  subsct.s  J. 


Hence,  we  have 

p{i^so}=  E  p\EEzJ-&-y  Y  =  y, 

allowable  >,  y' 

y  =  y',y  jp{y  =  y,y  =  y'l-5"} 

=  E  p\EEzJ^-y\ 

•P{Y  =  y,  J  =  r\y)  +  0(e-‘^3(/»o“) 

E  f{Ky)P{Y  =  y,j  =  j-\y} 

allowable  y.  J‘ 

+  0(^-r,(p,)“)_ 

where  X  =  |  7'  | .  For  allowable  X  attd  y,  however,  we  have 

IX-pA^I  =0{{pNy^*^), 

\y-{\-2p)pN\  =  0((p/y  )"’"*), 

by  definition.  The  bound  (17),  hence,  holds  for  every  term,  /(X,  y), 
in  the  sum  above.  It  follows  that 


The  exponent  (pn)^*  dominates  pn/m  as  m  =  0((pn)')  and 
25  >  1  -  T.  Further,  {pn)''‘^**/m  =  o(l)  as  1/2  +  5  <  r.  The 
statement  of  the  lemma  follows. 

As  before,  the  probability  that  one  or  more  memory  components 
is  not  retrieved  increases  monotonically  as  m  increases,  so  it 
suffices  to  show  that  the  theorem  holds  with  m  given  by  equality  in 
(6)  and  (7).  Now  let  y  >  3  be  as  in  the  statement  of  the  theorem, 
and  setT  =  1  -  1/y.A  choice  of  a  number  of  memories  according 
to  (6)  or  (7)  satisfies  the  requirements  of  Lemma  2,  so  that  the 
asymptotic  bound  of  (11)  holds  for  the  probability  that  a  single 
memory  component  is  not  retrieved  from  a  random  probe. 
The  theorem  is  now  proved  using  the  union  bound  as  in  the  last 
section.  □ 


IV.  Conclusion 

The  results  of  this  correspondence  and  those  of  Komlos  and 
Paturi  [6]  imply  that  the  folk  theorem  on  robustness  is  well  founded 
in  situations  where  there  is  a  distributed  storage  of  information  in 
the  network.  In  such  instances  the  neural  network  would  appear  to 
be  relatively  resilient  to  the  loss  or  damage  of  interconneaion 
weights.  For  the  outer-product  algorithm,  in  panicular,  each  neuron 
needs  to  retain  only  of  the  order  of  (l(log  n)  interconnection 
weights  out  of  a  total  of  n  possible  links  with  other  neurons  for 
useful  associative  properties  to  emerge.  These  results  also  appear  to 
generalize  to  other,  more  complex  situations,  and  this  is  under 
investigation. 

In  an  evocative  alternate  line  of  thought  we  could  consider 
situations  where  the  devil  in  the  network  is  not  malicious  but  is 
actively  well  disposed  towards  producing  useful  sparse  strucnires. 
The  issue  here  is  whether  we  can  exploit  carefully  designed  sparsity 
to  design  codes  (families  of  allowed  subsets  of  memories)  which 
have  high  storage  capacities  Specifically,  we  would  like  to  store 
large  numbers  of  memories  (high  capacity)  where  the  allowed  sets 
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of  fundamental  memories  that  can  be  picked  is  specified  by  a  (large) 
code.  The  intuition  here  is  that  large  gains  in  storage  capacity  may 
be  obtained  by  excluding  certain  pathological  sets  of  memories  from 
consideration  in  the  code,  and  that  such  resulting  codes  may  be 
designed  to  fit  suitable  sparse  architectures.  We  provide  an  illustra¬ 
tion  of  the  gains  that  are  possible  in  a  succeeding  paper  [7]. 

Appendix  A 
Large  DEviATtoNS 

We  quote  the  following  technical  lemmas  without  proof.  Lemma 
A1  is  a  generalisation  of  the  classical  Chebyshev  inequality  and 
provides  a  large  deviation  estimate  in  terms  of  generating  functions. 
Lemma  A2  is  a  straightforward  generalisation  of  a  classical  large 
deviation  central  limit  theorem  for  sums  of  binary  random  variables 
which  provides  good  uniform  estimates  for  the  probability  that  the 
sum  has  a  large  deviation  from  the  mean.  (The  corresponding 
version  of  the  result  for  indicator  random  variables  (taking  values  0 
and  1  only)  can  be  found,  for  instance,  in  Feller's  text  [8].) 

Lemma  AI:  Let  Jf  be  a  random  variable  and  x  >  0  any 
nonnegative  number.  Then 

/‘{A's  -x}  s  infe-'^£(p-"'^). 

rsO 

Lemma  A2:  Let  jr,  <  Jr2  be  any  two  real  numbers  and  let  { fy} 
be  a  sequence  of  i.i.d.  random  variables  drawn  from  a  sequence  of 
Bernoulli  trials  with 

X, ,  with  probability  q  =  \  -  p, 

Xj,  with  proability  p, 

where  0  <  p  <  1,  For  each  K  let  8^=  Sf-iTy  If  as  /f  -•  oo  the 
real  number  v  varies  such  that  w/ Va  »  and 

if  p  *  q, 

"  (o(/f’/‘‘).  ifp  =  q=l/2. 


Balanced  Codes  and  Nonequiprobable  Signaling 
A.  R.  Calderbank,  Member,  IEEE,  and  M  Klimesh 

Abstract— The  problem  of  shaping  signal  constellations  that  are  de¬ 
signed  for  the  Gaussian  channel  is  considered.  The  signal  constellation 
consists  of  all  points  from  some  translate  of  a  lattice  A ,  that  lie  within  a 
region  The  signal  constellation  is  partitioned  into  T  annular  subcon- 
stellations  0,,  ■  ■  ■  ,0|._,  by  scaling  the  region  J> .  Signal  points  in  the 
same  subconstellation  are  used  equiprobably,  and  a  shaping  code  selects 
region  0^  with  frequency  J,.  If  the  signal  constellation  is  partitioned 
into  annular  subconslellalions  of  unequal  size,  then  absent  some  clever¬ 
ness,  the  transmission  rale  will  vary  with  the  choice  of  codeword  in  the 
shaping  code,  and  it  will  be  necessary  to  queue  the  data  In  buffers.  It  Is 
described  how  balanced  binary  codes  constructed  by  Knutb  can  be  used 
to  avoid  a  data  rate  that  Is  probabilistic.  The  bask  idea  is  that  if 
symbols  0  and  1  represent  constellations  of  unequal  size,  and  if  all 
shaping  codewords  have  equally  many  O's  and  I's,  then  the  data  rale 
will  be  deterministk. 

Index  Temu— Bandwidth  eSicknt  communication,  shaping  codes, 
nonequiprobable  signaling. 

I.  Introduction 

We  start  with  a  basic  region  3?  in  and  by  scaling  we  obtain 
a  nested  sequence  ;#  =  otQ  ,  a,  ,  •  •  • ,  a  j-.  ,  J?  of  copies  of  df . 
Let  n  be  the  signal  constellation  comprising  all  points  from  (some 
fixed  translate  oO  a  lattice  A  that  lie  within  the  region  . 

Then  Qq  =  A  fi  j?  and  0,  =  A  O  (a,  ^  \  a,_ ,  ),  /  =  1,  •  •  • ,  7" 
-  1,  give  a  partition  of  0  into  annular  subconstellations  with 
increasing  average  power. 

The  reason  we  consider  signal  constellations  drawn  from  lattices 
is  that  signal  points  are  distributed  regularly  throughout  A^-dimen- 
sional  space.  If  signals  are  equiprobable,  then  the  average  signal 
power  Po  of  the  constellation  Qo  is  approximately  the  average 
power  P(  ^ )  of  a  probability  distribution  that  is  uniform  within  ^ 
and  zero  elsewhere;  thus 


then 


where 


^{l^x  -  liiPXj  +  <7X,)|  >  u(xj  -  X,)} 


•/tv 


is  the  volume  of  the  region  ^ .  We  rewrite  (1)  as 
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Po- (2) 


where 


C(^) 


IAA\^dv 


(3) 


is  the  normalized  or  dimensionless  second  moment.  Since  G(  Jt)  is 
dimensionless,  it  is  not  changed  by  scaling  the  region  It 
measures  the  effect  of  the  shape  of  the  region  ^  on  average  signal 
power. 
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Abfltract 

Nenral  aasodative  memoriea  viewed  as  a  coding  system  have  been  subject  to  the  criticism  that  the 
codes  have  very  low  rates.  In  a  fully-interconnected  network  of  n  neurons,  for  instance,  the  outer-product 
algorithm  has  a  storage  capacity  of  the  order  of  n/logn  memories:  specifically,  almost  all  choices  of 
n/4  log  n  memories  are  stored  as  fixed  points  by  the  outer-product  algorithm  when  the  interconuectivity 
graph  has  degree  n.  In  this  communication  it  is  shown  that  storage  capacities,  C — the  maximum  number 
of  memories  that  can  be  stored — can  be  improved  substantially  at  the  expense  of  the  size  of  the  code — 
the  family  of  admisiibU  C-sets  of  memories.  In  particular,  a  relation  between  code  size,  capacity,  and 
degree  of  the  interconnectivity  graph  is  shown:  for  ony  r  <  1,  a  corutant  c  >  0  and  a  code  of  $ite 
n  ^  can  be  found  tuck  that  it  is  pouibie  to  achieve  memory  itorage  capacities  which  are  O  (2"') 

in  recurrent  neural  networks  with  interconnectivity  graph  of  degree  S  Thus,  near-exponential 

capacities  can  be  obtained  for  codes  which  are  still  exponential  in  size.  An  interesting  and  useful  side 
effect  of  the  constructions  employed  in  this  paper  is  that  large  capacities  can  be  obtained  in  very  sparsely 
interconnected  structures  for  suitably  chosen  codes. 


1  INTRODUCTION 

A  folk  theorem  in  neural  computation  asserts  that  dense  interconnectivity  is  a  sine  gua  non  for  efficient  usage 
of  resources,  and,  in  particular,  that  sparser  structures  exhibit  a  degradation  in  computational  capability. 
This  communication  represents  an  effort  to  formally  examine  this  tenet  in  the  context  of  neureil  associative 
memory  and  a  recurrent  neural  network  structure.  An  appreciation  of  the  results  may  be  best  produced  in 
terms  of  a  coding  theoretic  analogue.  The  memories  to  be  stored  can  be  thought  of  as  codewords  with  the 
neural  network  being  the  decoder  which  corrects  errors  in  memories.  Any  set  of  memories  to  be  stored  is 
chosen  from  a  collection  of  admissible  sets  of  memories  which  forms  the  code.  In  applications  hitherto  the 
code  has  typically  been  the  set  of  all  subsets  of  binary  vectors — the  power  set  of  {  —  1, 1}".*  An  algorithm 


*The  support  of  research  grants  frenn  E.  I.  DuPont  de  Nemours,  Inc.  and  the  Air  Force  Office  of  Scientific  Research  (AFOSR 
89-0523)  is  gratefully  acknowledged. 

'There  have  been  some  explorations,  however,  of  sparse  eneodinp,  which  has  particular  significance  when  the  vector  rep¬ 
resentation  of  the  codewords  (or  memories)  is  in  terms  of  I’s  and  O’s  instead  of  I’s  and  -I’s.  In  such  cases  the  codewords 
(or  memories)  are  typically  chosen  such  that  the  number  of  c<»iiponents  taking  value  1  is  small  compared  to  the  number  of 
components  taking  value  0.  In  terms  of  electrical  circuit  realisations  of  such  networks,  each  menrtory  can  be  represented  by 
relatively  few  active  electrical  lines.  The  code  corresponding  to  such  a  sparse  encoding  is  clearly  mudi  more  dilute. 


Biswas,  Venkatesh 


2 


for  neural  associative  memory  is  a  procedure  which,  given  an  admissible  set  of  memories  from  a  given  code, 
produces  a  network  which  stores  the  memories  (typically  as  fixed  points).  For  a  given  code  the  two  main 
parameters  characterising  the  efBcacy  of  an  algorithm  are  the  capacity — the  largest  admissible  set  of  memories 
that  can  be  stored  by  the  algorithm — and  the  attraction  radius — the  number  of  component  errors  in  any 
memory  that  can  be  corrected  by  the  network.  With  a  code  consisting  of  all  subsets  of  binary  n- vectors,  the 
outer-product  algorithm,  for  instance,  has  a  capacity  of  the  order  of  n/logn  memories  correcting  a  linear 
number  of  random  component  errors  [1,2]. 

In  the  general  case  the  network  may  not  be  fuDy-interconnected,  but  may  have  connections  specified  by 
an  interconnectivity  graph  which  may  be  rather  sparse.  If  we  still  insist  that  almost  all  choices  of  memories 
within  capacity  be  stored — i.e.,  the  code  consists  of  all  subsets  of  {—1, 1}" — then  the  capacity  of  an  algorithm 
inevitably  decreases  as  the  degree  of  sparsity  increases.!  Jn  an  evocative  alternate  line  of  thought,  however, 
we  might  consider  designing  codes  which  are  proper  subsets  of  {—1,1}"  to  take  advantage  of  a  given  sparse 
interconnectivity  structure.  This  is  the  situation  examined  in  this  paper.  The  main  result  that  emerges  from 
our  investigations  is  the  following:  for  any  r  <  1  we  can  achieve  memory  storage  capacities  which  are  fl  (2"  ) 
in  recurrent  neural  networks  with  interconnectivity  graph  of  degree  0  for  carefully  chosen  codes  of 

log-size  6  (In  other  words,  we  can  find  an  exponential  number  of  choices  of  2"  memories  that  cem 

be  stored  as  attractors  in  suitably  chosen  sparse  networks.)  The  trade-off  here  is  in  increased  capacity  at 
the  expense  of  code  size.  Interestingly,  increased  sparsity  in  the  interconnectivity  graph  can  increase  storage 
capacity — for  code  choices  which,  as  it  will  turn  out,  are  exponential  in  size  as  long  as  the  interconnectivity 
graph  has  degree  O(logn).  The  design  methodology  here  lies  in  the  choice  of  the  code  (subsets  of  admissible 
memories)  as  a  function  of  the  network  sparsity,  and  the  selection  of  an  algorithm  to  specify  the  strengths  of 
the  interconnections  for  the  given  interconnectivity  graph  as  a  function  of  the  memories  to  be  stored.  Note 
that  unlike  the  situation  in  the  fully-interconnected  case  (the  interconnectivity  graph  having  degree  n),  all 
possible  choices  of  m  =  fl  (2"  )  memories  cannot  be  stored  in  the  sparse  network.  Rather,  the  memories  to 
be  stored  must  be  taken  from  the  set  of  admissible  memories  which  form  the  code. 

In  the  next  section  we  briefly  review  the  neural  model  and  formally  introduce  the  notions  of  codes  and 
capacity  in  the  context  of  neural  associative  memory.  In  section  3  we  introduce  a  simple  model  of  block 
sparsity  where  the  neurons  are  partitioned  into  mutually  non-communicating  sets.  For  this  structure  we 
demonstrate  a  code  which  is  sufficiently  rich  while  yielding  large  capacities  for  the  classical  outer-product 
algorithm.  In  section  4  we  show  how  the  results  for  the  simple  block  sparsity  model  extend  to  other  sparse 
structures,  and  in  particular,  the  nested  model  introduced  by  Baram  [5].  We  present  a  generalisation  of 
a  spectral  based  algorithm  for  the  block  sparsity  model  in  section  5,  and  show  concomitant  increases  in 
capacity.  Theorems  are  stated  in  the  body  of  the  paper  while  their  proofs  and  relevant  technical  lemmas  are 
developed  in  sequence  in  the  appendices. 

Notation  We  employ  usual  asymptotic  notation,  and  introduce  two  (non-standard)  notations:  if  {in}  and 
{j/n}  are  positive  sequences,  we  say  that 

•  i„  =  fi(!/n)  if  there  exists  K  such  that  x„/yn  >  K  for  all  n; 

•  i„  =  0(yn)  if  there  exists  L  such  that  i„/yn  <  L  for  all  n; 

•  =  e(yn)  if  In  =  n(yn)  and  i„  =  0[y„)\ 

•  Xn  ~  Vn  if  ^nlVn  — »  1  as  n  — »  oo;  we  also  say  that  in  Vn  if  *n  >  Vn  for  n  large  enough,  and 
Xn  ^  Vn  if  *n  <  Vn  for  fl  large  enough; 

•  Xn  =  o(yn)  if  In/yn  —  0  aS  fl  -♦  OO. 

We  denote  by  IB  the  set  {—1, 1},  and  by  [n]  the  set  of  integers  {1,2, . . .  ,n}.  By  ordered  multiset  we  mean 
an  ordered  collection  of  elements  with  repetition  of  elements  allowed.  We  will  use  the  terminology  m-set  and 
ordered  multiset  of  size  m  interchangeably.  All  logarithms  in  the  exposition  are  to  base  c. 

tThe  decresM  in  capacity  with  increaaed  ipanity  i*  not  cataatrophic,  however,  and  network  performance  aa  an  assodalive 
memory  degrade*  gracefully  [3,  4]. 
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2  ASSOCIATIVE  MEMORY 

2.1  Recurrent  Neural  Networks 

A  neuron  (after  McCulloch  and  Pitts  [6])  is  formally  a  linear  threshold  element  characterised  by  n  real 
weights,  w  €  01",  which,  in  response  to  a  vector  of  n  (real  or  binary)  inputs,  u,  produces  a  binary  output, 
V  €  IB,  as  the  sign  of  the  weighted  sum 

We  consider  a  network  of  n  formal  neurons.  The  state  of  the  network  at  any  epoch  is  the  n-vector, 
u  6  IB",  of  neural  outputs  at  that  epoch.  Neural  outputs  at  each  epoch  are  fed  back  and  constitute  the 
inputs  to  each  neuron  at  the  next  update  epoch.  The  aUowed  pattern  of  neural  i.nterconnectivity  is  specified 
by  the  edges  of  a  (bipartite)  interconnectivity  graph,  Gn,  on  vertices,  [n]  x  [n].  In  particular,  the  existence  of 
an  edge  {t,j}  in  G„  is  indies  jat  the  output  of  the  j-th  neuron  is  fed  back  as  input  to  the  i-th  neuron. 
The  network  is  characterised  by  an  n  x  n  matrix  of  weights,  W  =  where  Wij  denotes  the  (real)  weight 
linking  the  output  of  neuron  j  to  the  input  of  neuron  i.  (We  adopt  the  convention  that  a  weight,  Wij,  is 
zero  if  {i,j)  ^Gn  )  If  u  6  IB”  is  the  current  state  of  the  system,  an  update,  uj  of  the  state  of  the  i-th 
neuron  is  specified  by  the  linear  threshold  rule 

«<  =  I  H  1  • 

The  two  extreme  modes  of  neural  updates  are  synchronous,  with  every  neuron  being  updated  in  concert,  and 
asynchronous,  with  at  most  one  neuron  being  updated  at  any  instant.  Mixed  modes  of  operation  between  the 
two  extremes  are,  of  course,  feasible.  For  any  mode  of  operation  the  network  dynamics  describe  trajectories 
in  a  state  space  comprised  of  the  vertices  of  the  n-cube. 

The  utility  of  this  network  model  as  an  associative  memory  hinges  upon  the  observation  that  under 
suitable  symmetry  conditions  there  are  Lyapunov  functions  for  the  system  [7,  8].  In  particular,  for  each 
state  u  €  IB"  define  the  energy  function,  ^(u),  as  the  quadratic  form 

n  n 

^(«) = -  mz  = -  (u.  wu). 

.=1  j=i 

If  W  is  symmetric,  non-negative  definite,  then  the  function  E  is  non-increasing  along  any  trajectory  in  any 
mode  of  operation  [9]. 

We  can,  hence,  think  in  terms  of  an  “energy  landscape”  with  states  embedded  in  it.  Trajectories  in 
this  landscape  tend  to  go  “downhill.”  States  which  form  local  “energy”  minima,  hence,  determine  system 
dynamics;  each  such  state  possesses  a  6asin  of  attraction  comprised  of  neighbouring  states  of  higher  “energy” 
which  are  mapped  into  the  state  at  the  local  minimum.  This  geometric  picture  is  particularly  persuasive 
for  an  associative  memory  application  where  we  wish  to  store  a  desired  set  of  states — the  memories — as 
fixed  points  of  the  network,  and  with  the  property  that  errors  in  an  input  representation  of  a  memory  are 
corrected  and  the  memory  retrieved.  The  challenge  here  is  to  choose  a  matrix  of  weights  such  that  the 
desired  memories  are  located  at  energy  minima. 

Let  u  G  B"  be  a  memory  and  0  <  p  <  1  a  parameter.  Corresponding  to  the  memory  u  we  generate  a 
probe  u  G  B"  by  independently  specifying  the  components,  itj,  of  the  probe  as  follows: 

-  “j  with  probability  1  —  p 

~  (  —Uj  with  probability  p.  '  ' 

We  call  u  a  random  probe  with  parameter  p. 

Definition  2.1  We  say  that  a  memory,  u,  is  a  monotone  p-attractor  if,  with  probability  approaching  one 
as  n  — ►  oo,  the  network  corrects  all  errors  in  a  random  probe  with  parameter  p  in  one  synchronous  step.  We 
call  p  the  (fractional)  attraction  radius.  We  also  say  that  u  is  stable  if  it  is  a  monotone  0-attractor. 

tThe  model  allows  for  a  real  threskoU  as  well,  but  this  will  not  be  important  to  our  discussion.  We  will  throughout  assume 
a  sero  threshold  for  each  neuron.  We  also  adopt  the  convention  sgnO  =  1. 
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Remarks:  Note  that  stable  memories  are  just  fixed  poii  Is  of  the  network.  Also,  by  the  Borel  strong  law, 
the  fr2iction  of  the  number  of  components  in  the  probe  which  are  in  error  (i.e.,  not  equal  to  the  corresponding 
components  of  the  memory)  is  concentrated  at  the  expected  value  />.* 

For  m  >  1  let  u*,  u"*  €  B"  be  an  m-set  of  memories  to  be  stored.  The  outer-product  algorithm 

specifies  the  interconnection  weights,  Wij,  according  to  the  following  rule:  for  i  6  [n],  {t,j}  £  Gn, 

/3=1 

In  the  fully-interconnected  situation,  for  instance,  W  is  symmetric,  non-negative  definite  so  that  suitable 
associative  properties  result.  In  general,  if  the  interconnectivity  graph,  G„,  is  symmetric  then,  under  a 
suitable  mode  of  operation,  there  is  a  Lyapunov  function  for  the  network  specified  by  the  outer-product 
algorithm. 

2.2  Codes  and  Capacity 

For  given  integers  m  >  1,  n  >  1,  a  code,  is  a  collection  of  ordered  multisets  of  size  m  from  B”.  We 
say  that  an  m-sct  of  memories  is  admissible  iff  it  is  in  Thus,  a  code  just  specifies  which  m-sets  are 

allowable  as  memories.  Henceforth  when  we  refer  to  a  memory  we  mean  a  binary  n-tuple  in  some  admissible 
set  from  a  code  JC^ .  Examples  of  codes  include:  the  set  of  all  ordered  multisets  of  size  m  from  B” ;  a  single 
multiset  of  size  m  from  B";  all  collections  of  m  mutually  orthogonal  vectors  in  B”;  all  m-sets  of  vectors  in 
B"  in  general  position. 

Clearly,  if  all  memories  in  an  admissible  m-set  of  memories  are  stable  (or  are  monotone  p-attractors), 
then  so  are  the  m!  ordered  multisets  generated  by  all  permutations  of  the  original  m-set.  (For  m  linear  in  n, 
for  instance,  the  number  of  permutations  is  of  the  order  of  2'"'°*"  for  some  constant  c.)  We  hence  need  to 
guard  against  defining  trivial  codes  generated  by  permutations  of  a  few  basic  ordered  multisets  of  memories. 
Define  two  ordered  multisets  of  memories  to  be  equivalent  if  they  are  permutations  one  another.  We  define 
the  size  of  a  code,  to  be  the  number  of  distinct  equivalence  classes  of  m-sets  of  memories.  We  will  be 
interested  in  codes  of  relatively  large  size:  log  |AC^|/n  — ►  oo  as  n  —►  oo.  In  particular,  we  require  at  least  an 
exponential  number  of  choices  of  (equivalence  classes  of)  admissible  m-sets  of  memories.  For  a  given  code, 
AC™ ,  we  confer  a  probability  distribution  on  memories  by  choosing  an  m-set  of  meiiK>ries  from  the  uniform 
distribution  on  AC™. 

For  eeich  fixed  n  and  interconnectivity  graph,  Gn,  oa  algorithm,  A',  is  a  prescription  which,  given  an 
m-set  of  memories,  produces  a  corresponding  set  of  interconnection  weights,  Wij,  i  G  [n],  {i,j}  G  Gn-  Let 

m>l,n>lbea  doubly-indexed  sequence  of  codes,  and  let  X  "be  an  algorithm  (corresponding  to 
an  underlying  interconnectivity  graph,  G„).  For  m  >  1  let  >I(u*,...  ,u™)  be  some  attribute  of  m-sets  of 
memories.  (The  following,  for  instance,  are  examples  of  attributes  of  admissible  sets  of  memories:  all  the 
memories  are  stable  in  the  network  generated  by  A';  almost  all  the  memories  are  monotone  p-attractors.) 
For  given  n  and  m,  we  choose  a  random  m-set  of  memories,  u*,  . . . ,  u™,  from  the  uniform  distribution  on 

Definition  2.2  A  sequence,  is  a  capacity  function  for  the  attribute  A  (or  A-capacity  for  short)  if 

for  A  >  0  arbitrarily  small: 


f  An  altemative — and  perhaps  more  appealing — model  for  generating  random  probes  is  to  choose  the  probe  at  random  from 
the  Hamming  ball  of  radius  pn  at  U.  The  notion  of  a  radius  of  attraction  is  intuitively  and  geometrically  much  clearer  for  this 
model.  However,  by  the  sphere  hardening  effect,  almost  all  probes  generated  in  this  model  are  concentrated  at  the  surface  of 
the  Hamming  ball  surrounding  the  memory,  to  that  the  number  of  errors  is  again  essentially  pn.  Tbe  antdytical  capacity  results 
that  derive  for  this  model  are  formally  indistinguishable  from  the  model  we  have  adopted  in  equation  (1),  though  the  technical 
details  are  somewhat  different.  The  present  format  is,  however,  sli^tly  more  convenient  mathematically. 

^  We  define  admissible  m-sets  of  memories  in  terms  of  ordered  multisets  rather  than  sets  so  as  to  obviate  certain  technical 
nuisances. 
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a)  P  {-4(u*, . . .  ,u'")}  — »  1  as  n  — •  oo  whenever  m  <  (1  —  A)Cn; 

b)  P  ,  u"*)}  — »  0  as  n  — >  oo  whenever  m  >  (1  +  X)C„. 

We  also  say  that  Cn  is  a  lower  A-capacHy  if  property  (a)  holds,  and  that  C„  is  an  upper  A-capacity  if 
property  (b)  holds. 

Remark:  The  capacity  function  implicitly  depends  upon  the  sequence  of  interconnectivity  graphs,  G„, 
the  sequence  of  codes,  and  the  algorithm,  X,  as  well  as  on  the  desired  attribute.  A,  of  the  memories. 


3  BLOCK  SPARSITY 

Sparse  interconnectivity  graphs  are  of  importance  in  practical  realisations.  Besides  the  obvious  advantages 
in  programming  when  there  are  relatively  few  weights,  cost  considerations  strongly  favour  sparse  intercon¬ 
nectivity  as  interconnections  dominate  the  silicon  real  estate  in  hardware  realisations  of  these  networks. 
One  of  the  simplest  forms  of  sparsity  we  might  enjoin  is  block  sparsity  where  the  neurons  are  partitioned 
into  disjoint  subsets  of  neurons  with  full-interconnectivity  within  each  subset  and  no  neural  interconnections 
between  subsets.  The  weight  matrix  in  this  case  takes  on  a  block  diagonal  form,  and  the  interconnectivity 
graph  is  composed  of  a  set  of  disjoint  complete  bipartite  sub-graphs. 

More  formally,  let  1  <  6  <  n  be  a  positive  integer,  and  let  {/i,...  partition  (n]  such  that  each 

subset  of  indices,  It,  t  =  1,  ...,  n/6,  has  sire  |/t|  =  i.#  We  call  each  h  a  block  and  b  the  block  size.  We 
specify  the  edges  of  the  (bipartite)  block  interconnectivity  graph  BGn  by  {«',  j}  €  BG„  iff  i  and  j  lie  in  a 
common  block.  For  any  given  m-set  of  memories,  u*,  . . . ,  u”*,  we  specify  the  interconnection  weights,  Wij, 
i  €  [n],  {»,  j}  €  BGn,  by  the  outer-product  algorithm  of  prescription  (2). 

Proposition  3.1  With  interconnectivities  specified  by  the  block  interconnectivity  graph,  BGn,  nnd  weights 
by  the  outer-product  algorithm,  the  energy  function,  E,  is  non-increasing  along  any  trajectory  in  any  mode 
of  operation. 

Proof:  Let  W*  be  the  sub-matrix  of  weights  corresponding  to  the  components  of  block  It.  Note  that 
Wt  is  symmetric,  non-negative  definite  for  each  k  =  1,  . . . ,  n/6.  Now,  for  any  vector  u  6  B",  let  Ut  G  B* 
denote  the  binary  6-tuple  of  components  of  u  in  block  /*.  We  can  then  write  the  energy  function  as 

n/b  n/k 

f;{u)  =  -{u,Wu)  = 

k=l 

where,  for  each  k,  Et  denotes  the  energy  function  for  block  It.  As  the  blocks  are  disjoint,  two  distinct  vectors 
iii  and  U(  do  not  share  any  components.  Consequently,  each  Et  is  non-increasing  along  any  trajectory  in 
B”,  and  thus,  so  is  K.  I 

Let  CK’f)  denote  the  complete  code  of  all  choices  of  ordered  multisets  of  size  m  from  B". 

Theorem  3.2  Let  the  block  size  6  6e  such  that  6  =  n(logn)  os  n  — *  oo,  and  let  0  <  p  <  Ijl  be  a  fixed 
attraction  radius.  Then,  for  block  interconnectivity  graphs  BGn,  complete  codes  CK^ ,  and  the  outer-product 
algorithm,  the  monotone  p-attractor  capacity  is  (1  —  2p)^6/21og  6n. 

Corollary  3.3  Under  the  conditions  of  theorem  S.2  the  fixed  point  memory  capacity  is  6/2  log  6n. 


HHere,  as  in  the  rest  of  the  paper,  we  ignore  details  with  regard  to  rounding  to  the  nearest  integer  in  an  effort  to  simplify 
notation.  The  modifications  to  be  made  for  forma]  correctness  will  be  obvious,  and  do  not  affect  the  results. 
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Corollary  3.4  For  a  fully-interconnected  graph,  complete  codes  CK'^ ,  and  the  outer-product  algorithm,  the 
fixed  point  memory  capacity  is  n/4Iogn. 


Corollary  3.4  is  the  main  result  shown  by  McEliece,  ei  al  [1].  Theorem  3.2  is  a  slight  extension  of  the 
result  and  shows  the  natural  result  that  increased  sparsity  causes  a  loss  in  capacity  if  the  code  is  complete, 
i.e.,  all  choices  of  memories  are  considered  admissible.  It  is  possible,  however,  to  design  codes  to  take 
advantage  of  the  sparse  interconnectivity  structure  as  the  following  simple  construction  indicates. 

Without  loss  of  generality  let  us  assume  that  block  I\  consists  of  the  first  b  indices,  [6],  block  I2  the 
next  6  indices,  [26]  —  [6],  and  so  on,  with  the  last  block  consisting  of  the  last  6  indices,  [n]  —  [n  —  6].  We 
can  then  partition  any  vector  u  €  B"  as 

iz  \ 


u  = 


(3) 


\  «n/*  / 


where  for  fc  =  1,  . . . ,  n/6,  u*  is  the  sub-vector  of  components  corresponding  to  block  /*.  For  M  >  1  we  form 
the  block  code  BK,^  as  follows;  to  each  ordered  multiset  of  M  vectors,  u' ,  . . . ,  u"  from  B” ,  we  associate 
a  unique  ordered  multiset  in  BK.^  '  by  lexicographically  ordering  all  A/"/*  vectors  of  the  form 


u 


<»3 


^n/» 


. ®n/*  €  [Af]. 


Thus,  we  obtain  an  admissible  set  of  A/"/*  memories  from  any  ordered  multiset  of  M  vectors  in  B"  by 
“mixing”  the  blocks  of  the  vectors.  We  call  each  Af-set  of  vectors,  u*,  . . . ,  G  B”,  the  generating  vectors 
for  the  corresponding  admissible  set  of  memories  in  BK^"'^ . 


Example;  Consider  a  case  with  n  =  4,  block  size  6  =  2,  and  A/  =  2  generating  vectors.  To  any  2-set 
of  generating  vectors  there  corresponds  a  unique  4(=A/"^*)-8et  in  the  block  code  as  follows: 


( “1 

( 

9 

A 

«3 

» 

«3 

«3 

1 

«3 

) 

«3 

i 

"3 

1  J 

^  wi  / 

1  y 

1  / 

The  basic  idea  behind  the  formation  of  the  block  code  is  that  if  each  sub-vector  uj  is  stable  with  respect 
to  the  fully-interconnected  submatrix  of  weights  W*  (i.e.,  the  energy  Et{u^)  is  a  local  minimum)  then  the 
vector  u  is  stable  with  respect  to  the  matrix  of  weights  W  (i.e.,  the  energy  ^(u)  is  also  a  local  minimum)  for 
the  block  interconnectivity  gr£q>h  BGn-  Thus,  we  can  mix  any  combination  of  stable  vectors  to  obtain 
a  stable  vector  u.  Consequently,  if  we  choose  M  small  enough  that  for  most  choices  of  M  vectors,  u',  . . . , 
in  B",  each  of  the  vectors  uj ,  a  =  1,  . . . ,  Af  is  stable  for  each  of  the  blocks  k  =  1,  . . . ,  n/6,  then  we  can 
generate  a  relatively  large  number  of  stable  vectors  (Af"/*  in  number)  by  mixing  the  blocks.  We  will  take 
care  of  technical  details  in  the  appendix:  specifically,  we  need  stability  of  Af  vectors  for  a  large  number  of 
n/6  blocks  simultaneously;  further,  to  estimate  capacity  when  there  is  error-correction  we  will  have  to  guard 
against  the  possibility  that  pn  errors  in  a  memory  translates  into  a  disproportionate  share  of  errors  in  one 
or  more  blocks. 

Theorem  3.5  Let  0  <  p  <  1/2  be  a  fixed  attraction  radius.  Then  we  have  the  following  capacity  estimates 
for  block  interconnectivity  graphs  BGn,  block  codes  BK^,  and  the  outer-product  algorithm: 
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a)  If  the  block  size  b  satisfies  nloglog6n/61og5n  — »  0  os  n  — ►  oo  then  the  monotone  p-attractor  capacity 

[  21og6n  J 

b)  Define  for  any  v 

If  the  block  size  b  satisfies  6/ log  n  — >  oo  an</61og6n/Ioglog6n  =  0{n)  as  n  -*  oo,  then  C„{u)  is  a  lower 
monotone  p-attractor  capacity  for  any  choice  of  v  <  3/2  and  C„(»/)  is  an  upper  monotone  p-attractor 
capacity  for  any  v  >  3/2. 

Corollary  3.6  If,  for  fixed  t  >  I,  we  have  b  =  n/t,  then  for  any  fixed  attraction  radius  0  <  p  <  1/2,  graphs 
BGn,  codes  BK'ff ,  and  the  outer-product  algorithm,  the  monotone  p-attractor  capacity  is 

Corollary  3.7  For  any  fixed  attraction  radius  0  <  p  <  1/2,  and  for  any  r  <  1,  a  constant  c  >  0  and  a 
code  of  size  ft  ^2****~^j  con  be  found  such  that  it  is  possible  to  achieve  lower  monotone  p-attractor  capacities 
which  are  (I  {2"')  in  recurrent  neural  networks  with  interconnectivity  graph  of  degree  0  (n'"’’). 

Remarks:  If  the  number  of  blocks  is  kept  fixed  as  n  grows  (i.c.,  the  block  size  grows  linearly  with  n)  then 
capacities  polynomial  in  n  are  attained.  If  the  number  of  blocks  increases  with  n  (i.e.,  the  block  size  grows 
sub-linearly  with  n)  then  super-polynomial  capacities  are  attained. 


4  NESTED  SPARSITY 

Recently,  Baram  [5]  has  proposed  the  investigation  of  certain  nested  codes  geared  towards  exploiting  certain 
classes  of  sparsely  interconnected  neural  networks.  The  basic  model  can  be  described  in  terms  of  a  nesting  of 
block  interconnectivity  graphs:  a  hierarchy  of  blocks  is  defined  with  blocks  at  any  given  nesting  level  derived 
recursively  from  blocks  at  the  previous  level.  More  precisely,  let  6  as  before  denote  the  block  size,  1  <  6  <  n, 
and  let  the  positive  integer  1  <  h  <  logn/log6  denote  the  nesting  depth.  For  each  nesting  level  /=!,..., 
h,  we  recursively  define  a  disjoint  collection  of  blocks,  /{,  ...,  as  follows. 

Base:  As  in  the  block  interconnectivity  graph,  at  nesting  level  1  the  blocks  I},  ,  I^^  partition  [n], 

with  each  block  having  size  6. 

Recursion:  Let  /{,  ...,  be  blocks  corresponding  to  nesting  level  1.  For  k  —  1,  ...,  n6~’  let 

i\  e  f*  be  a  specification  of  indices.  The  blocks  corresponding  to  nesting  level  (/  1)  are  now  chosen  so  as 

to  partition  the  specified  set  of  indices  and  such  that  each  block  has  size  6. 

We  specify  the  edges  of  the  (bipartite)  nested  interconnectivity  graph  NGn  by  {i,i}  6  NGn  iff  *  and 
j  lie  in  a  common  block  at  any  nesting  level.  The  nested  code  we  consider  is  just  the  block  code 

defined  for  the  lowest  nesting  level,  /  =  1.  Again,  for  any  m-set  of  memories,  u',  . . . ,  u"*,  we  specify  the 
interconnection  weights,  Wij,  i  G  [n],  {i,i}  G  NG„,  by  the  outer-product  algorithm  of  prescription  (2). 

The  nested  interconnection  graph  structure  is  very  similar  to  that  of  the  block  interconnection  graph 
with  fully-interconnected  disjoint  subsets  of  neurons.  For  the  nested  structure,  however,  a  small  number  of 
interconnections  are  permitted  between  blocks.  At  the  first  nesting  level  the  structure  is  that  of  the  block 
interconnectivity  graph  with  n/6  disjoint  blocks  of  fully-interconnected  neurons.  For  the  next  nesting  layer, 
one  neuron  is  specified  from  each  of  the  n/6  blocks  of  the  first  layer  and  these  are  grouped  into  n/6^  blocks  of 
fully-interconnected  neurons.  Thus,  a  specified  neuron  in  each  block  in  layer  1  is  permitted  connections  with 
neurons  in  an  additional  6—1  blocks.  This  exercise  is  repeated  recursively  for  each  of  the  remaining  layers. 
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We  hence  have  essentially  a  block  interconnectivity  graph  except  that  certain  rare  “long  range  interactions” 
are  allowed  across  blocks  of  fully-interconnected  neurons.  The  intuitive  idea  behind  this  setup  is  to  have  a 
sparsely  interconnected  network  very  reminiscent  of  a  block  structure  in  which  a  certain  limited  amount  of 
communication  is  allowed  between  blocks  of  distinct  features. 

It  turns  out  that  the  ideas  developed  in  the  analysis  of  block  sparsity  can  be  applied  to  nested  structures 
as  well,  and  in  fact,  the  relatively  small  number  of  inter-block  interconnections  do  not  alter  capacity. 

Theorem  4.1  Consider  nested  interconneciivity  graphs  NGn  ^iHh  block  size  b  and  nesting  depth  h,  nested 
codes  ,  and  the  outer-product  algorithm.  Then,  for  any  0  <  p  <  1/2,  the  monotone  p-atiractor  capacity 
estimates  of  theorem  S.5  continue  to  hold  under  the  same  conditions  on  block  size  b.  In  particular,  the 
capacity  estimates  are  independent  of  the  nesting  depth. 


5  SPECTRAL  ALGORITHM 

The  spectral  algorithm  was  proposed  (cf.  Venkatesh  and  Psaltis  [9]  and  Personnaz,  et  al  [10])  as  an  alternative 
to  the  outer-product  algorithm  in  the  context  of  neursJ  associative  memory.  At  the  expense  of  some  additional 
algorithmic  complexity  the  algorithm  circumvents  the  need  for  orthogonality  of  memories  in  the  outer- 
product  algorithm  and  improves  the  fixed-point  storage  capacity  from  the  sub-linear  capacities  that  obtain 
(for  fully-interconnected  networks)  to  capacities  linear  in  n  [9].  Concomitant  increases  in  capacity  may  now 
be  obtained  for  block  interconnectivity  graphs  and  block  codes  by  extending  the  spectral  algorithm  in  a 
manner  analogous  to  the  treatment  earlier. 

More  specifically,  consider  a  block  interconnectivity  graph  BGn  with  block  size  b,  and  the  block  code 
BKn  ■  (As  before,  assume  the  indices  are  assigned  sequentially  to  the  blocks  /j,  ...,  /„/».)  Consider  a 

choice  of  an  admissible  A/"/*-set  of  memories  from  corresponding  to  the  Af-set  of  generatbg  vectors, 

u',  . . . ,  G  B".  Now  consider  the  A-th  block,  k  €  (n/Aj-  Define  the  b  x  M  matrix  of  column  vectors 

U*  =  (  ui  uj  •••  ]  . 

(Recall  that  u“  is  the  vector  of  components  corresponding  to  block  7*  of  the  generating  vector  u®.)  Let  Ati, 
...,  XicM  be  fixed  positive  numbers  and  let  At  =  dg(Ati,...  ,XkM)-  For  each  k  define  the  sub-matrix  of 
weights,  Wj,  corresponding  to  the  components  of  block  It  by 

Wt  =UtAiUl, 


where  denotes  the  pseudo-inverse  of  U*.  (If  U*  is  full-rank  then  Uj  =  Uj,  where  iff  denotes 

the  transpose  of  U*.)  The  above  prescription  generalises  the  spectral  algorithm  to  block  interconnectivity 
graphs  and  block  codes.  [The  case  of  a  single  block  (6  =  n)  yields  the  original  algorithm.] 

Proposition  5.1  For  each  k,  let  A*]  =  •  •  •  =  XtM  =  Aj  >  0,  and  assume  that  M  generating  vectors  are 
chosen  from  a  sequence  of  symmetric  Bernoulli  trials.  Then  with  interconnectivitics  specified  by  the  block 
interconnectivity  graph,  BGn,  ond  weights  by  the  spectral  algorithm,  the  energy  function,  E,  is  non-increasing 
along  any  trajectory  in  any  mode  of  operation  with  probability  approaching  one  asymptotically. 

Proof:  As  before,  E(u)  =  53*^*  Et{ut),  where  Et  is  the  energy  function  corresponding  to  block  It- 
Now  each  U*  is  full-rank  with  probability  approaching  one  asymptotically  m  a  consequence  of  a  theorem  of 
Kahn,  Komlde,  and  Szemeredi  (see  Appendix  E).  Thus,  with  high  probability,  each  sub-matrix  of  weights, 
W*,  is  of  the  form  W*  =  A*Ut  (UjU*)”*  Uf .  Thus,  W*  is  symmetric  and  its  only  eigenvalues  are  0 
and  A*  >  0,  so  that  it  is  non-negative  definite.  Consequently,  Et  decreases  along  any  trajectory  with  high 
probability,  and  hence,  so  does  E.  I 
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Theorem  5.2  If  the  block  size  satisfies  6/log  n  — ►  oo  as  n  — ►  oo  then,  for  block  interconnectivity  graphs 
BGn,  block  codes  BK^,  and  the  spectral  algorithm,  the  fixed  point  capacity  is  6"/*. 

Corollary  5.3  If,  for  fixed  1  >  1,  the  block  size  b  =  n/t,  then  nnder  the  conditions  of  theorem  4-1  the  fixed 
point  capacity  is  t~*n* . 

Corollary  5.4  If,  for  fixed  0  <  t  <  1,  <Ae  block  size  b  =  n' ,  then  under  the  conditions  of  theorem  4-1  Ihe 
fixed  point  capacity  is  n’'" 

In  particular,  slightly  better  capacities  obtain  for  the  same  code  sizes  than  for  the  outer-product  algorithm. 
The  main  technical  problem  here  concerns  the  stability  requirement  that  all  the  matrices  Ut  be  full  rank. 

6  CONCLUSIONS 

We  have  demonstrated  that  code  size  can  be  traded  off  for  increased  storage  capacity,  and  that  very  large 
capacities  obtain  for  carefully  selected  codes  which  are  still  exponential  in  size.  The  analysis  carried  out  here 
for  block  sparsity  where  we  have  a  uniform  block  size  can  be  readily  extended  in  obvious  fashion  for  non- 
uniform  block  sizes,  where  each  block  Ik  has  a  distinct  block  size  6^.  For  instance,  if  each  6^  =  @(n)  then  the 
monotone  p-attractor  capacity  for  graphs  BGn  (with  these  block  sizes),  codes  BK^,  and  the  outer-product 
algorithm  becomes  ~  2p)^6t/2iog6in.  (The  block  code  is  again  formed  in  the  indicated  manner  by 

mixing  blocks.)  Nesting  blocks  does  not  change  the  intrinsic  capacity  significantly;  however,  nesting  may  be 
used  as  a  vehicle  for  introducing  ‘long  range  interactions”  between  distinct  blocks  of  features. 

A  feature  of  the  analysis  of  block  sparsity  in  this  paper  is  that  we  confer  a  probability  distribution  on 
admissible  memories  from  the  uniform  distribution  on  the  code.  The  probability  distribution  we  consider 
is,  hence,  on  ordered  multisets  of  memories  and  not  on  individual  memories;  in  particular,  the  distribution 
is  not  a  product  distribution  (except  in  simple  cases  such  as  the  complete  code).  This  may  be  seen  as  a 
limitation  in  the  technique  espoused  here.  We  do  not  currently  know  whether  it  is  possible  to  simultaneously 
achieve  large  capacities  and  code  sizes  to  take  advantage  of  specific  sparse  structures  with  memories  drawn 
from  a  suitable  product  distribution.  A  general  open  question  along  these  lines  is  the  design  of  codes  of  large 
size  and  achieving  large  capacity  for  any  given  (sparse)  interconnectivity  graph. 

A  Preliminary  Lemmas 

Consider  a  set  of  N  fully-interconnected  neurons.  Let  u^,  ...,  €  IB^  be  an  A/-set  of  memories  with 

components  drawn  from  a  sequence  of  symmetric  Bernoulli  trials,  i.e.,  the  memory  components  are  i.i.d. 
with 

P{u?  =  -l}  =  P{uf  =  l}  =  l/2,  1=1 . N,  a=\,..,M. 

Note  that  we  are  considering  the  complete  code  here  and  that  the  product  distribution  on  memories  above 
induces  a  uniform  distribution  on  admissible  A/ -sets  of  memories  in  the  code  CK^ .  The  weights  are  specified 
by  the  outer-product  algorithm.  Specifically, 

M 

0=i 

Let  0  <  p  <  1/2  be  the  desired  attraction  radius.  Corresponding  to  each  memory,  u®,  let  u®  denote  a  random 
probe  at  mean  distance  pn  from  u®  generated  according  to  prescription  (1).  If  each  of  the  m  memories  is  to 
be  a  monotone  p-attractor  then  we  will  require  that  each  of  the  NM  random  sums 

N 

i=i 


i  =  1,. ..  , W,  a  =  1,...  ,M, 


Biswas,  Venkatesh 


10 


be  positive  with  high  probability.  Form  the  random  variables 

=  uf  u/ufu/,  J  = 

Substituting  for  the  weights  Wij  we  then  have 

xf  =  y,®  +  zf. 


where 


and 


j^i P^a 


(4) 

(5) 

(6) 

(7) 


Let  us  begin  by  estimating  the  probability  that  a  particular  memory  component  is  not  retrieved  from  a 
random  probe  in  one  synchronous  step.  We  need  the  following  technical  result  on  large  deviations. 

Lemma  A.l  Let  xi  <  be  any  two  real  numbers  and  lei  {(>}  be  a  sequence  of  i.i.i.  random  variables 
drawn  from  a  sequence  of  Bernoulli  trials  with 


.  _  f  Xi  with  probability  q  =  1  —  p 
~  \  with  probability  p, 


where  0  <  p  <  1.  For  each  K  lei  Sk  =  If  as  K  — ►  oo  the  real  number  v  varies  such  that 

vfsfK  — »  oo  and 

/  ifpqiq 

\  <.(^3/“)  ,/p=,=  l/2 

then 


P  {Sk  -  K{px3  +  qxi)  <  -v(ij  -  xi)}  ~  P  {Sk  -  K{px2  +  qxi)  >  v(x2  -  xi))  ~ 


\/2ir  V 


The  result  is  just  a  slight  extension  of  the  classical  large  deviation  form  of  the  DeMoivre-Laplace  limit 
theorem  for  sums  of  (0, 1)  random  variables. 

For  notational  simplicity  denote  =  (1  -  2p)*Af.  For  0  <  p  <  1/2  define  the  function 

f(p)  =  ((1  -  .  (8) 

Lemma  A. 2  If  M  satisfies  —*  oo  and  M  =  o(N)  as  N  —*  oo,  then 

Remark:  The  asymptotic  form  for  component  error  above  is  somewhat  different  from  the  simil2ir  expression 
derived  in  McEliece,  et  al  [1]  as  a  consequence  of  differing  modes  adopted  for  the  generation  of  random 
probes. 


Proof:  Fix  the  component  index  i  and  the  memory  index  o.  Note  that  the  random  variables  defined 
in  equation  (4)  are  independent,  so  that  the  random  variables  y^"  and  are  independent  and  comprised 
of  sums  of  independent,  ±1  random  variables. 
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Fix  0  <  6  <  1/6.  For  j  —  tor  0  =  a  the  r.v.’s  are  i.i.d.  and  take  on  values  -1  and  1  with  probabilities 
p  and  I  —  p,  respectively.  Let  5+  denote  the  event  that  =  ““““  =  1-  We  then  have  by  an  application 
of  lemma  A.l  that 

p{|y;‘*-(i-2p)(fv-i)-A/|>fv‘/*+*|^+}  =  p  ||^C5“-(i-2/>)(Af-i) 

=  0(e-“^“).  (9) 

Conditioned  upon  E*  let  5'*'  denote  the  set  of  sample  points  for  which  the  following  bolds; 

|y.«  _  (1  _  2p)(N  _  1)  -  A/|  < 


We  then  have  from  equation  (9)  that 


P  =  1-0  • 

From  elementary  considerations  we  then  have  that 

P  {  A’?  <  0  I  =  P  {  Z?  <  ~Y^  \  €*}  =P  {Zf  <  -Y/*  I  5+,f+}  +  O  («■“''”)  • 

For  j  ^  t  and  0  ^  a  the  random  variables  are  i.i.d.  and  symmetric,  and  take  values  in  IB.  Hence,  Zf’  is 
just  a  symmetric  random  walk  over  (N  -  1)(M  -  1)  steps.  Also,  conditioned  upon  £'*'  we  have  that  within 
S*  the  r.v.  Y^  takes  on  values  whose  deviation  from  (1  — 2p)(W  — 1)  + A/  is  at  most  .  For  each  sample 

value  taken  by  (conditioned  upon  £*  and  5‘*’)  lemma  A.l,  hence,  applies  for  Af  as  in  the  statement  of 
the  lemma.  In  particular,  let  us  say  a  deviation,  y,  is  allowable  if  |y|  <  ,  and  let  us  denote  by  p(y) 

the  probability  that  y  is  allowable: 

p(y)  =  p  { y;*  =  (1  -  2p)(Ar  -  1)  +  A/  +  y  I  5+, } . 

For  each  allowable  y,  and  for  Af  as  in  the  statement  of  the  lemma,  we  then  have  by  lemma  A.l  that 

P  {Zr  <  -(1  -  2p)(N  -l)-M-y}^  exp  -  (1  -  2p)'j  . 

Hence, 

P{A‘'<0|^+}  =  ^  p(y)P{Z?<-(l-2p)(Af-l)-Af-y)  +  0(c-'‘^”) 

aJiowabiey 

As  S'*"  occurs  independently  with  probability  (1  —  p)  it  follows  that 

P  jx-  <  0,^*)  ~  -p  -  (1  -  w)  .  (10) 

In  similar  spirit  let  us  define  the  event  £~  that  =  u®u®  =  —  1,  and  conditioned  upon  S~  the  set  of 
sample  points  S~  for  which 

|y<®  -  (1  -  2p){N  -  1)  +  Af|  < 

P  {S~\€-}  =  1  -  C>  . 


As  before,  we  obtain 


Biswas,  Venkatesb 


12 


Proceeding  in  similar  vein  we  can  now  demonstrate  that 

P  I T”  <  0,  r-  )  ~  ^  e,p  +  (1  -  2,))  .  (U) 

Combining  equations  (10)  and  (11)  and  equation  (8)  completes  the  proof  of  the  lemma.  I 

The  next  lemma  concerns  the  joint  distribution  of  r  sums,  X,**,  g  =  1,  . . . ,  r.  The  result  is  essentially 
the  main  lemma  demonstrated  in  McEliece,  ei  al  [1],  except  for  certain  correction  factors  corresponding  to 
the  terms  /(p)  arising  from  a  different  generation  of  random  probes.  (The  result  we  show  below  is  actually 
a  slightly  stronger  version  which  appears  in  Venkatesb  [12,  pages  220-239].) 

Lemma  A. 3  Let  r  he  any  fixed,  positive  integer,  and  let  (if,  Qg)  G  [.A^]  x  [A/],  g  =  1,  . . . ,  r  be  distinct  pairs 
of  integers.  If  M  >  N”  with  3/4  <  <r  <  1,  then,  under  the  hypothesis  of  lemma  A. 2, 

P{A:r/<0,  5=l,...,r}~?I,  (N^oo). 

Essentially  the  same  proof  that  appears  in  McEliece,  et  al  [1]  or  Venkatesb  [12]  can  be  modified  to  show  the 
following  result,  care  being  taken  to  rigourously  account  for  all  correction  terms.  We  will  not  go  into  it  here. 


B  Proof  of  Theorem  3.2 


Consider  a  set  of  n  neurons,  interconnected  according  to  a  block  interconnectivity  graph  with  block  size 
6.  Let  u* ,  . . . ,  u"*  €  IB"  be  an  m-set  of  memories  with  components  drawn  from  a  sequence  of  symmetric 
Bernoulli  trials,  and  let  the  weights  be  specified  by  the  outer-product  algorithm.  Note  that,  as  before,  we  are 
considering  the  complete  code  here  and  that  the  product  distribution  on  memories  above  induces  a  uniform 
distribution  on  admissible  m-sets  of  memories  in  the  code  CK'ff. 

For  the  memories  to  be  monotone  />-attractors  a  necessary  and  sufficient  condition  is  that  the  components 
of  the  memories  corresponding  to  each  block  be  retrieved  from  a  random  probe.  Let  us  consider  block 
Ik  for  definiteness,  and,  as  before,  let  uj,  ...,  u”  G  B*  be  the  vectors  of  components  of  the  memories 
corresponding  to  block  Ik-  Now,  within  each  block  the  components  of  a  memory  are  updated  independently 
of  the  components  in  other  blocks  as  a  consequence  of  block  sparsity.  As  the  components  of  the  memories 
as  well  as  the  random  probes  are  drawn  independently,  the  results  of  appendix  A  apply  here  with  M  =  m 
and  N  =  b.  We  will  need  the  following  version  of  the  inclusion-exclusion  principle. 

Lemma  B.l  Let  Ex,  .. .,  E^f  be  measurable  subsets  of  a  probability  space.  For  1  <  r  <  N,  let  Or  be  the 
sum  of  probabilities  of  all  sets  formed  by  intersecting  r  of  the  events  E\,  ... ,  En: 


l<Jl<j2<  -<jr<W  vr=i 

Then  for  every  K ,  I  <  K  <  N/2, 

7K  f  ^  1 

£(-i)-v,  <  p  I U  U  E 

r=l  [j  =  l  J  r  =  l 


For  fixed  r,  let  ffr  denote  the  sum  over  the  (  )  distinct  choices  of  r  memory  components  in  block 

Ik  of  the  probabilities  that  a  distinct  choice  of  r  memory  components  is  not  retrieved; 


«Tr 


E  pjn^' 
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where 

K;  = «.“/  E  “'•«>  *r '  =  E  E  “*7  “7  “f.  “>  • 

Lemma  A. 3  applies  to  this  case,  so  that  as  6  — »  oo  we  have 


<r.  ^ 


r! 


Here  is  given  by  lemma  A. 2  as 


^5^5;  1  v2mjr 


with  /(p)  as  defined  in  equation  (8)  and  6^  =  (1  —  2p)*6.  We  can  now  apply  lemma  B.l  to  estimate  the 
probability,  that  one  or  more  memory  components  is  not  retrieved  inside  block  It .  Let  us  choose  a  rate  of 
growth  for  the  number  of  memories,  m,  (as  a  function  of  the  block  size,  b)  small  enough  that  the  term  mbqi 
is  bounded.  By  choosing  larger  and  larger  (but  fixed)  sizes  K  in  lemma  B.l,  as  6  — *  oo  we  can  make  both 
the  upper  and  lower  bound  on  the  probability  qt  approach  arbitrarily  close  to  1  —  Alternatively,  the 

probability,  p*  =  1  —  qt,  that,  for  each  memory,  all  the  components  in  the  block  are  retrieved  from  a  random 
probe  with  parameter  0  <  p  <  1/2  is  given  by  pt  ^  e"”****.  As  there  are  no  interconnections  in  between 
blocks,  the  retrieval  of  memory  components  is  independent  across  blocks.  Hence,  the  probability  of  retrieval 
of  all  the  components  of  all  the  memories  is  have  thus  established  the  following 


Lemma  B.2  With  qi  as  given  above,  lei  m  increase  slowly  enough  with  b  so  that  mbqi  remains  bounded 
as  b  oo.  Lei  p(m,  n,p)  denote  the  probability  that,  for  each  memory,  a  random  probe  with  parameter 
0  <  p  <  \/2  is  mapped  into  the  memory  in  one  synchronous  step  (i.e.,  all  memory  components  are  retrieved 
from  the  probe  in  one  synchronous  step).  Then,  for  any  0<i<l<swe  have 

<  p(m,n,p)  <  e-‘"”*»‘,  (6-00). 


Now,  for  any  fixed  choice  of  6  let  M  be  an  integer  sequence  such  that  as  n  — *  00 


2  log  bn 


flog  log  6n -Hog 
logfrn 


Hoglog6n\ 
\  log’^n  / 


(12) 


It  is  easy  to  check  that  mbqi  remains  bounded  as  n  — *  00  if  m  is  chosen  equal  to  M  for  any  fixed  choice  of 
6.  Lemma  B.2  hence  applies. 

Now,  for  any  A  >  0  (chosen  arbitrarily  small),  consider  a  number  of  memories 

.  (l-l-A)(l-2p)»6 

2  log  bn 

For  any  choice  off>0fix0<(<l  and  choose  6  =  — logc.  Choose  M  according  to  equation  (12)  for 
such  a  choice  of  6.  Using  the  upper  bound  for  p{m,n,p)  in  lemma  B.2  with  m  replaced  by  M  yields  that  for 
such  a  choice  of  M,  p{M,n,p)  c.  Now  it  is  clear  that  as  n  — ►  00  we  will  have  m  ^  M  whatever  be  the 
choices  of  A,  c,  and  t.  By  uniformity,  hence,  p(m  ,n,p)  p{M,n,p)  c.  As  c  can  be  chosen  arbitrarily 
small  it  follows  that  (1  —  2p)^6/21og6n  is  an  upper  monotone  p-attractor  capacity. 

Now  again,  for  a  choice  of  A  >  0  small,  consider  a  number  of  memories 

"  _  (1  -  A)(l  -  2p)^6 
^  2  log  bn 

For  any  choice  of  €  >  0  chosen  arbitrarily  small  fix  s  >  1  and  choose  6  =  — s“'  log(l  —  f).  Choose  M 
according  to  equation  (12)  for  such  a  choice  of  6.  Now  using  the  lower  bound  for  p{m,n,p)  in  lerruna  B.2 
with  m  replaced  by  M  yields  that  for  such  a  choice  of  M,  p{M,n,p)  ^  1  —  f .  Now  we  have  m  ^  M  as 
n  — »  00  whatever  be  the  choices  of  A,  e,  and  $.  By  uniformity,  hence,  p(m  ,n,p)  i  p{M,n,p)  ^  1  —  t. 
As  (  can  be  chosen  arbitrarily  small  it  follows  that  (1  —  2p)^b/2\ogbn  is  also  a  lower  monotone  p-attractor 
capacity.  This  concludes  the  proof  of  theorem  3.2.  I 
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C  Proof  of  Theorem  3.5 

Let  ,  . . . ,  u*^  €  B"  be  a  randomly  chosen  A/ -set  of  generating  vectors  with  components  drawn  from  a 
sequence  of  symmetric  Bernoulli  trials.  Corresponding  to  the  Af-set  of  generating  vectors  there  is  a  unique 
Af"/*-set  of  memories,  u*,  . . . ,  G  B",  in  the  block  code  .  Note  that  the  product  distribution 

on  generating  vectors  induces  the  uniform  distribution  on  the  block  code.**  For  {t,i}  E  BG„  the  weights 
prescribed  by  the  outer-product  algorithm  are  given  by 

^=1 


As  0  runs  through  the  indices  1  through  M"/*,  for  each  of  the  M  generating  vectors,  u^,  the  corresponding 
term  ufuj  occurs  times  in  the  sum  above.  (This  follows  from  the  construction  of  the  block  code: 

each  vector  of  components  corresponding  to  block  7*  and  generating  vector  is  used  in  the  i-th  block 
of  exactly  vectors  in  the  generated  Af"/*-set  of  memories  in  the  block  code.)  Thus: 

M 

Wii  =  Af 5;^  tif ,  {i,j}  €  BGn.  (13) 

^=1 


Scaling  all  the  weights  by  the  positive  scale  factor  does  not  affect  capacity.  The  situation  is  now 

similar  to  that  analysed  earlier:  the  outer-product  weights  for  the  graph  BGn  generated  from  a  set  of 
vectors  whose  components  ue  drawn  from  a  sequence  of  symmetric  Bernoulli  trials. 

Now  we  claim  that  the  A/"/*-8et  of  memories  are  monotone  p-attractors  iff  each  vector  of  components, 

uf ,  corresponding  to  each  block  /*,  k  =  1 . n/6,  and  each  of  the  generating  vectors,  ,  0  =  I,  . . . ,  M, 

is  a  monotone  p-attractor.ll  This  follows  because  of  the  disjoint  nature  of  the  blocks,  and  the  indpendent 
assignment  of  components  to  the  random  probe.  But  by  independence  across  the  blocks  this  is  equivalent  to 
requiring  that  each  of  the  M  generating  vectors  are  monotone  p-attractors  for  the  matrix  of  weights  specified 
by  equation  (13).  Lemma  B.2  now  applies  directly.  In  particular,  let  the  number  of  generating  vectors,  M, 
be  chosen  as  in  equation  (12).  With  a  choice  of  s  >  1,  A  =  — s“*  log(l  —  e)  all  the  generating  vectors  are 
monotone  p-attractors,  and  hence  so  are  the  A/"/*  generated  memories,  with  probability  at  least  1  —  f ;  with 
a  choice  of  0  <  t  <  1  and  6  =  —t~^  logc  some  generating  vectors  fail  to  attract  montonically  over  a  radius 
p,  and  hence  so  do  some  of  the  Af"/*  generated  memories,  with  probability  at  least  1  —  c. 

Consider  first  the  case  where  the  block  size  6  satisfies  n/6  =  o(log6n/loglog6n)  as  n  — ►  oo.  Then 


A/"/* 


»/» 

(1  +  ‘>«(1))- 


The  choices  s  >  1,  6  =  — s~'  log(l  -  c),  and  0  <  t  <  1,  6  =  — loge  are  both  absorbed  in  the  <><(1)  term, 
so  that 

ro-2p^]"'* 

[  2iog6n  J 

is  both  a  lower  and  an  upper  monotone  p-attractor  capacity. 


**Herein  lie*  (he  reaaon  we  defined  code*  in  term*  of  ordered  multiset*  of  memorie*  instead  of  seta  of  memories.  We  would  like 
to  preserve  a  product  distribution  on  memory  components  because — as  we  saw  in  the  previous  sections — this  provides  certain 
amenities  in  analysis.  This,  however,  corresponds  to  an  um  model  with  replacement,  and  there  is  a  non-zero  (albeit  small) 
probability  that  the  same  memories  are  drawn  again.  If  the  code  is  defined  in  terms  of  sets  of  memories  instead  of  ordered 
multiset*  this  results  in  a  small,  but  annoying,  non-uniformity  in  the  distribution  induced  on  the  code. 

^^The  utility  of  the  random  probe  with  parameter  p  model  is  apparent  here.  The  statement  would  not  continue  to  hold  is 
tots  if  the  probe  were  to  be  selected,  for  instance,  randomly  from  the  Hamming  ball  of  radius  pn  at  the  memory.  The  difficulty 
is  that  pn  component  errors  in  a  mentory  need  not  translate  into  pb  error*  in  each  block,  and  a  disproportionate  assignment 
of  component  errors  in  any  one  Mode  will  cause  non-convergence  to  the  memory  component*  in  that  Uock.  A  (provable)  large 
deviation  limit  theorem  for  the  hypergeometric  distribution  is  needed  here. 
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Now  consider  the  case  where  the  block  size  b  satisfies  n/6  =  n(log6n/loglog6n)  and  6/log  n  — ►  oo  as 
n  — ►  oo.  Define  for  any  1/ 

c„(u)  =  2^ 

For  any  fixed  choice  of  6  it  follows  that  C„(v)  ^  if  1/  <  3/2,  while  C„(v)  ^  Af"/*  if  1/  >  3/2.  Thus, 

Cn(»')  is  a  lower  monotone  p-attractor  when  u  <  3/2  and  an  upper  monotone  p-attractor  when  u  >  3/2. 
This  concludes  the  proof  of  the  theorem.  I 


D  Proof  of  Theorem  4.1 


Let  NGn  be  a  nested  interconnectivity  graph  with  block  size  6  and  nesting  depth  h.  As  before,  let  u\  . . . , 

6  IB"  be  a  random  M-set  of  generating  vectors  (corresponding  to  a  random  M"'^*-set  of  memories  in 
the  nested  code  )■  A®  before,  the  product  distribution  on  the  components  of  the  generating  vectors 

induces  the  uniform  distribution  on  J^IC^  '  .  For  nesting  level  1  the  situation  is  identical  to  that  of  block 
sparsity.  Specifically,  if  indices  t  and  j  lie  in  a  common  block  at  nesting  level  1,  i.e.,  {i,j}  C  II  for  some 
k  €  [n/6],  then  the  corresponding  weight  tVij  is  as  given  by  equation  (13). 

Let  il  £  ll,  k  =z  I,  . . . ,  n/b  denote  the  specified  indices  which  comprise  nesting  level  2.  Consider  block 
II-  The  probability  that,  for  each  memory,  ail  6  components  corresponding  to  this  block  are  retrieved  from 
a  random  probe  with  parameter  p  in  one  synchronous  step  is  certainly  less  than  the  probability  that,  for 
each  memory,  the  6—1  components  corresponding  to  indices  j  £  are  retrieved.  (Equality  iff  the 

probability  of  retrieval  of  the  ij-th  component  is  one.)  But  these  6—1  indices  arc  only  interconnected  with 
other  indices  in  the  block  7^,  so  that  the  retrieval  of  the  memory  components  corresponding  to  these  6-1 
indices  in  each  block  (sans  the  specified  indices  ij)  is  a  stochastically  independent  event  across  the  blocks. 
(The  fact  that  there  are  interconnections  across  blocks  through  the  specified  indices,  t^,  cannot  affect  the 
other  6—1  indices  in  each  block  in  the  first  synchronous  step,  but  only  on  later  steps:  in  one  synchronous 
step,  only  the  value  of  the  i|-th  component  contributes  to  the  updates  of  the  remaining  6—1  components 
corresponding  to  block  7^.)  Consequently,  we  can  again  partition  the  problem  into  n/6  independent  blocks. 
It  hence  suffices  to  consider  just  the  generating  vectors  rather  than  the  vastly  larger  number  of  generated 
memories  in  the  code.  Specifically,  the  requirement  that  6—1  components  of  each  memory  be  retrieved  in 
each  block  is  equivalent  to  Just  requiring  that  for  each  block  the  6—1  components  (disregarding  the  specified 
components  ij)  of  each  generating  vector  are  retrieved  from  a  random  probe  with  parameter  p.  Let  P  denote 
this  probability.  (This  argument  would  not  be  hold  if  we  had  to  consider  retrieval  of  the  ij-th  components 
as  well  because  of  the  dependencies  caused  by  the  inter-block  connections.)  An  argument  similar  to  that 
leading  up  to  lemma  B.2  now  yields  that  for  any  0  <  <  <  1,  and  rate  of  growth  of  M  with  6  such  that 
A/(6  —  1)^1  is  bounded. 


p(M,n,p)<P  <  exp  ^  =  exp  ^-tMnq[  ^1 -  , 

where,  with  the  appropriate  substitution  of  parameters  in  lemma  A. 2  we  have 

/(p)v/A7 


9i  = 


{-( 


With  M  as  defined  in  equation  (12)  for  a  choice  of  <  Ot  <  1  and  6  =  — 1“*  logt,  it  follows  that 


p{M,n,p)  <  i)  =€+  0  . 

The  upper  monotone  p-attractor  capacity  estimates  of  theorem  3.5,  hence,  continue  to  hold  for  the  nested 


case. 
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To  obtain  lower  capacity  estimates,  note  that  removing  the  n/6  neurons  corresponding  to  the  specified 
indices  results  in  a  collapse  of  the  nested  structure  into  a  block  interconnectivity  graph  BGn{h-i)ih  with 
block  size  6—1.  Now,  an  examination  of  the  random  variables,  X“,  shows  that  each  added  interconnection 
weight  specified  by  the  outer-product  algorithm  improves  the  probability  of  retrieval  of  the  corresponding 
memory  component.  The  connections  from  the  indices  ij  within  a  block,  hence,  contribute  a  positive 
probability  to  the  probability  of  retrieval  of  any  memory  component  within  the  block.  Furthermore,  we  have 
the  uniformity  property  that  the  probability  of  retrieval  of  each  component  improves  monotonically  with 
n  for  fixed  M .  Hence,  the  probability  p{M,n,p)  that,  for  each  memory,  all  the  components  are  retrieved 
from  a  random  probe  with  parameter  p  exceeds  the  probability,  P  ,  that  the  corresponding  vectors  with 
the  specified  n/6  components  removed  are  retrieved  in  the  graph  BGn(i-\)n  block  size  6—1.  But, 
here  again  we  have  an  independent  partition  into  n/6  blocks,  and  it  suffices  to  estimate  the  probability  that 
the  components  of  the  corresponding  generating  vectors  (with  the  specified  n/6  components  removed)  are 
retrieved.  The  argument  leading  to  lemma  B.2  again  works  and  we  have  for  any  s  >  1,  and  rate  of  growth 
of  M  with  6  such  that  A/(6  —  I)?!  is  bounded, 

p{M,n,p)  >  P"  >  exp  — 1)  j  _  exp  ^-sMn^i  ^1  -  , 

where,  with  the  appropriate  substitution  of  parameters  in  lemma  A.2  we  have 

il-2p)^/2r{b-l)  V  2A/  )} 

With  M  as  defined  in  equation  (12)  for  a  choice  of  s  >  1  and  6  =  -s~‘  log(l  —  c),  it  follows  that  when  6  is 
such  that  6/log  n  — *  oo  as  n  -*  oo,  then 

piM,n,p)  >  e-“(‘-^(*))  =  l-e+C>(i). 

The  lower  monotone  p-attractor  capacity  estimates  of  theorem  3.5,  hence,  also  continue  to  hold  for  the  nested 
case,  and  this  concludes  the  proof.  I 

E  Proof  of  Theorem  5.2 

Consider  the  block  interconnectivity  graph  BGn  with  block  size  6  and  the  block  code  BfC^  '  .  Let  . . . , 
G  IB"  be  a  randomly  selected  Af-set  of  generating  vectors,  as  before,  and  let 

Ui  =  [  uj  ul  ■■■  ] 

be  the  corresponding  b  x  M  matrix  of  column  vectors  corresponding  to  the  block  Ik-  To  show  that  each 
of  the  generated  memories  is  stable  it  suffices  to  show  that,  for  each  generating  vector,  u^,  each  of  the  n/6 
vector  of  components,  ,  is  stable.  If  U*  is  full-rank  then  the  sub-matrix  of  weights  corresponding  to  block 
7*  is  given  by 

Wi=U*A*(UrUt)"'uf,  (14) 

where  A  is  diagonal  with  positive  diagonal  terms  Xti,  ••  •,  XkM-  If  the  representation  (14)  holds  (i.e.,  U*  is 
full-rank)  then,  for  any  generating  vector,  u^,  we  have 


Wtuf  =  Xipul  ^  Uk 

as  Xkp  >  0.  The  proof  of  the  theorem  will  be  complete  if  the  probability  that  each  of  the  matrices  Ut  is 
full-rank  approaches  one  for  large  n.  This  follows  as  a  consequence  of  the  following  new  result  due  to  Kahn, 
Komlds,  and  Szemerddi  [11]. 
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Lemma  E.l  Lei  be  a  random  b  x  b  binary  matrix  whose  components  art  drawn  from  a  sequence  of 
symmetric  Bernoulli  trials.  Then  there  is  a  constant  1  <  rf  <  2  such  that  the  probability  that  A*  is  singular 
is  no  more  than  d"*  asymptotically  as  b  -*  co. 

Remark:  The  earlier  (1977)  estimate  of  Komlos  of  6"*/^  for  the  probabihty  that  ^44  is  singular  does  not 
suffice  for  our  purposes  for  block  sizes  b—0  (n*/*). 

Using  the  above  lemma,  the  probability  that  all  the  matrices  U^,  k  =  1,  n/b  are  full-rank  is  at  least 
(1  —  which  asymptotically  tends  to  one  rather  quickly.  I 
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Areas  in  the  brain  that  are  believed  to  carry  out  the  functions  of 
pattern  separation,  categorization,  and  associative  memory,  exhibit 
variable  amounts  of  sparsed  connectivity  between  neurons.  In 

Artificial  Neural  Networks,  one  of  the  parameters  that 

distinguishes  an  architecture  from  another  is  the  connectivity  of 
the  units.  In  feedforward  layered  networks,  the  connections  are 

asymmetrical,  and  the  information  flows  from  the  input  layer 

through  several  hidden  layers  to  a  single  output  layer. 
Backpropagation  is  a  learning  algorithm  for  feedforward  layered 
nets  which  (due  mainly  to  the  fact  that  it  operates  on  networks  that 
contains  hidden  units)  can  discover  useful  internal  representations 

and  thus  can  be  a  powerful  tool  for  attacking  difficult  problems 

like  image  and  speech  recognition.  Several  papers  have  reported 
the  correlation  between  the  number  of  hidden  units  and  the  learning 
ability  of  the  BP,  but  did  not  report  any  relationship  between  the 
number  of  connections  between  the  units  to  properties  such  as 
learning  times,  memory  capacities.  In  this  paper  1  report  results 
of  computer  simulations  of  learning  with  the  BP  algorithm  on  layered 
networks  which  differ  in  their  connectivities.  The  results  include 

comparison  of  learning  rates  between  different  levels  of  sparsed 

networks  and  the  percentage  of  examples  needed  to  reach  the  critical 
point  in  which  a  neural  network  is  said  to  have  learned  the 
function  in  a  probably  approximately  correct  sense  ( PAC ) .  These 
results  can  serve  as  a  proposal  for  a  more  theoretical  research 
on  the  learning  capacity  of  layered  networks. 


A 


(2)  Introduction 


Neural  nets  have  various  parameters  that  influence  the  computational 
characteristics,  mainly  the  number  of  neurons,  number  of  hidden 
neurons  (in  the  case  of  layered  nets),  activation  functions  (linear, 
semi-linear,  non-linear  threshold),  deterministic/probabilistic  output 
decision  rule,  and  connectivity.  In  real  biological  networks, 
sparsness  of  connections  is  very  common  (basically  due  to  the  physical 
space  limitation  that  make  fully  connected  networks  with  millions  of 
neurons  impractical)  and  it  is  clear  that  sparsed  networks  do  work. 
Is  it  better  or  worse  (and  by  which  factor)  to  have  sparsed  networks  ? 
This  is  not  a  trivial  question.  First  we  have  to  identify  variables  of 
the  neural  network  in  question,  and  ask  how  does  sparsness  relates  to 
them.  In  addition  we  have  to  divide  the  variables  into  two  groups:  the 
first  includes  variables  that  are  independent  of  the  learning 
algorithm  that  is  used  in  teaching  the  neural-net,  and  the  second 
includes  the  variables  that  are  algorithm-dependent.  One  important 
algorithm-dependent  variable  is  the  learning  time  (or  solution- 
convergence  time).  The  most  important  algorithm-independent  variable 
is  the  capacity  of  a  neural  network.  In  a  fully  connected  network 
(like:  Hopefield,  1982)  capacity  is  the  number  of  vectors  that  can  be 
stored  in  a  network  and  recalled  reliably;  in  feedforward  layered 
nets,  capacity  reflects  the  number  of  functions  (input-output 
mappings)  that  can  be  realised  by  the  network.  The  exact  definition  of 
capacity  also  depends  on  how  do  we  want  to  define  "reliably"  in  the 
above.  I  will  follow  the  definition  of  1 earnabi 1 i ty-capaci ty  which  is 
best  described  by  the  following;  In  the  formal  definition  of  probably 
approximately  correct  (PAG)  learning,  we  start  with  some  underlying 
function,  and  pick  examples  at  random  from  this  very  same  function. 
Specifically,  we  pick  points  in  the  input  space  randomly  according  to 
some  probability  distribution  and  generate  ordered  pairs 
(x_l,  f(x_l)),  ...  ,  (x_M,  f(x_M))  from  the  function  f.  The  question 

is  how  many  examples  do  we  need  to  train  the  network  on,  before  we 
can  say  that  it  has  learned  the  function  (L.G.  Valiant,  Nov  1984). 
This  answer  is  tied  to  the  capacity  of  the  network  architecture  (it 
is  proportional  to,  but  larger  than  the  capacity).  Note,  however, 
that  we  have  assumed  that  the  function  f  can  be  realised  within 
the  network  architecture  (by  architecture  I  mean  the  number  of  neurons 
and  the  specific  link  connectivity  between  them);  if  it  can  t. 


needed  to  be  shown  to  the  network 


then  the  number  of  examples 
will  never  be  finite  since  the  network  can't  learn  to  begin  with. 

The  results  that  I  will  present  will  compare  the  learning  times 
specifically  to  the  Back propagation  algorithm  on  different  sparsed 
networks,  and  also  will  compare  the  number  of  examples  that  are  needed 
to  be  shown  during  the  learning  cycle  for  different  sparsed  networks, 
to  achieve  learnabi 1 ity .  But  before  I  present  the  results,  I  will  give 
an  ov  -view  of  experimental  and  theoretical  results  that  have  been 
reported  in  the  field. 


Computational  functions  of  the  hipocampus 

This  section  describes  experimental  results  and  system  level  theory  of 
the  hipocampal  functions;  I  will  focus  on  the  variety  of  sparsed  real 
neural-nets  architectures  that  this  region  in  the  brain  has,  and 
explain  how  the  different  levels  of  sparsness  contribute  to  different 
important  characteristics.  The  hipocampus  is  one  of  the  oldest  parts 
of  the  brain.  It  gets  inputs  from  many  different  areas  of  the  cortex, 
including  the  cerebral  cortex,  parietal  cortex,  the  temporal  lobe 
visual  and  auditory  areas,  and  the  frontal  cortex.  Effects  of  damage 
to  the  hipocampus  show  that  the  very  long  term  memories  are  not 
influenced.  Different  experiments  (Squire,  1986;  Squire  &  Zola-Morgan, 
1988)  have  repeatedly  shown  that  the  the  hipocampus  plays  a  vital  role 
in  the  storage  of  declarative  memories  such  as  episodic  memory  and 
semantic  memory  (hiearchy  of  facts).  Within  the  hipocampus,  there  is  a 
three  stage  sequence  of  processing  consisting  of  granule  cells  (which 
receive  from  the  entorhinal  cortex),  the  CA3  pyramidial  cells,  and  the 
CAl  pyramidial  cells.  The  CA3  area  has  an  extensive  recurrent 
structure  with  a  relatively  large  contact  probability  (4.37.  in  the 
rat).  It  is  believed  that  some  sort  of  an  autoassociation  memory 
matrix  ( Kohonen ,  Oja,  &  Lehtio,  1981)  is  being  represented  there. 
Kohonen  (1972)  as  shown  that  the  probability  of  a  connection  between 
neurons  must  not  be  very  1 ow  in  order  to  maintain  a  large  signal  to 
noise  ratio  in  the  retrieval  of  an  output  vector  from  an  associative 
matrix.  This  would  answer  very  well  why  in  the  CA3  region  the  ratio  of 
the  number  of  neurons  to  the  number  of  their  inter-connections  is 


relatively  low  i.e.  having  less  neurons  allows  to  have  high 
connectivity  which  improves  the  quality  of  the  associations.  In  the 
stage  preceding  the  hipocampus  the  axons  pass  via  a  competitive 
network.  Its  function  is  to  accept  non-or thognal  input  vectors  and 
or thogona 1 i ze  them;  Kohonen  (1972)  showed  that  an  associative  network 
has  larger  fidelity  of  recall  for  stored  vectors  which  are  closer  to 
being  fully  mutually  orthognal .  Another  preceding  stage  is  the  mossy 
fibers  system  connecting  dendate  gyrus  cells  to  the  CA3  cells.  This 
system  is  characterized  by  very  low  1  ink  probabi 1 i tv  and  is  used  to 
decorrelate  input  patterns ;  in  the  rat,  the  probability  of  a  link  in 
this  region  is  0.000078.  Pattern  separation  can  be  achieved  using  this 
type  of  a  sparsed  network  since  the  probability  is  very  high  that  each 
CA3  cell  is  influenced  by  a  different  assembly  of  dentate  granuale 
cel  Is. 

Effect  of  Sparsness  to  Capacity  in  Hopef ield-1 ike  Networks 

Associative  networks  based  on  the  architecture  of  the  Hopefield 
Network  have  linear  threshold  units  fully  interconnected.  Two  popular 
schemes  of  storing  methods  are  the  outer — product  method  (J.J. 
Hopefield,  April  1982),  and  the  spectral  method  for  construction  of 
the  weight  matrix  (Venkatesh/Psaltis,  1989).  The  relatively  simple 
construction  of  the  linear  transformation  matrix  (whose  elements  are 
the  weights  of  the  connections  between  the  neurons)  by  means  of  the 
outer-products  method  yields  a  storage  capacity  of  N/ ( 4  logN)  stable 
memory  states  (where  N  is  the  number  of  neurons).  The  spectral 
algorithm  yields  storage  capacj.ty  which  is  linear  in  N.  However  in 
contrast  to  the  fully  connected  networks,  nested  neural  networks 
(nested  means  not-ful ly  connected  and  structured  in  tree-like 
architecture)  who  have  the  same  characteristics  as  the  Hopefield 
network,  have  been  shown  to  have  vastly  greater  number  of  stable 
memory  states  than  fully  connected  networks  programmed  in  either  of 
the  above  schemes  ( Yoram  Baram,  March  1989).  These  type  of  networks 
resemble  the  fractal  forms  studied  by  Mandelbrot  (B.  Mandelbrot, 
1983),  but  differ  in  allowing  connections  not  only  between  neigboring 
layers  but  between  all  layers  through  several  neurons  shared  in 
common.  All  nested  networks  require  considerably  less  connections  than 
fully  connected  Ones.  In  one  particular  architecture  having  1000 
neurons  divided  into  subnets  each  having  8  fully  connected  neurons,  it 
was  shown  that  the  number  of  stable  memory  states  is 


2  <  1006^-7^ )  =  ^  0294x10^'  (given  that  the  states  in  each  subnet  is 
selected  to  satisfy  certain  requirements  assuring  stability  and  error 
correction).  For  randomly  chosen  vectors  that  are  to  be  stored,  it  was 
shown  that  the  probability  of  picking  a  vector  orthognal  to  a  vector 
which  is  already  stored,  is  inversely  proportional  to  the  square  root 
of  the  number  of  neurons  in  a  subnet  (  a  subnet  being  the  elementary 
fully  connected  structure  in  the  overall  nested  network).  Since 
orthognal  vectors  satisfy  the  stability  requirement  for  a  memory 
vector,  in  nested  networks  consisting  of  relatively  small  subnetworks 
(i.e.  subnets  having  small  number  of  fully  connected  neurons),  the 
orthogonality  condition  allows  for  the  storage  of  stable  vectors  with 
relatively  high  probability.  The  capacity  in  less  than  ideal 
conditions  (i.e.  where  the  vectors  stored  in  the  subnets  were  only 
nearly  orthognal)  was  also  much  higher  compared  to  the  fully  connected 
Hopefield  network;  with  a  nested  network  of  B1  neurons  divided  into  10 
subnetworks,  each  having  9  fully  interconnected  neurons,  it  was 
possible  to  store  2048  stable  vectors. 

Summary  of  existing  experimental  and  theoretical  results  regarding  the 
effects  of  soarsness  on  neural  networks 

Fully  interconnected  networks  are  not  realisable  when  there  are  vast 
numbers  of  neurons,  and  it  is  evident  in  the  brain  that  sparsed 
networks  do  work,  and  achieve  extremely  large  numbers  of  memory  state 
associations.  In  the  above  section,  results  of  theoretical  analysis 
have  shown  that  not  only  can  sparsed  networks  function  reliably,  but 
in  certain  architectures  they  improve  the  capacity  over  fully 
connected  networks  with  the  same  number  of  neurons.  It  would  be 
interesting  to  know  what  is  the  effect  of  sparsing  feedforward  layered 
networks;  will  the  number  c  realisable  input-output  mappings 
increase  ?  If  not,  then  how  is  capacity  related  to  sparseness  ? 

The  following  sections  will  attempt  to  answer  these  questions. 


(Jl)  Experimental  results 


In  all  the  following  experiments,  I  constructed  3  layer  networks  where 


sparsing  was  done  only  on  the  links  that  connect  the  input  to  the 
hidden  layer.  The  following  setup  was  used  in  every  trial  of  the 
exper  iiiient : 

(1)  An  original  sparsed  network  is  built  according  to  a  certain 
architecture  (architecture  being  the  number  of  neurons  and  a 
specific  interconnection  matrix.  The  main  parameter  that  selects 
a  given  architecture  is  the  probability  (P)  for  the  existence 
of  a  link  connecting  the  input  layer  to  the  hidden  layer);  the 
weights  assigned  to  the  links  are  picked  randomly.  Then  several 
binary  vectors  are  placed  individually  at  the  input  layer,  and 
the  network  generates  a  real  valued  response  at  the  output 
layer.  The  input-output  pairs  are  then  placed  aside  for  use  in 
the  next  steps. 

(2)  A  sparsed  network  with  same  architecture  (same  P  but  different 
random  weights  for  the  links)  is  built.  An  Unsparse  network  is 
built  (i.e.  every  neuron  in  the  input  layer  connects  to  all 
neurons  in  the  hidden  layer,  and  every  neuron  in  the  hidden 
layer  connects  to  all  neurons  in  the  output  layer)  with  random 
weight  assignment. 

(3)  Teach  the  unsparsed  and  sparsed  networks  on  the  vectors 

generated  by  the  original  sparsed  net.  Record  the  overall  error 

versus  the  number  of  learning  cycles  (I  fixed  the  maximum  number 
of  cycles).  The  reason  that  I  generate  vectors  from  an  existing 
network,  and  teach  them,  instead  of  picking  randomly  a  set  of 
input-output  pairs,  is  because  not  every  mapping  can  be  realised 
by  the  student  networks;  thus  generating  vectors  to  be  taught  by 
the  same  architecture-network  as  the  original  generating  network 
guarantees  that  a  solution  exists  for  the  mapping,  and  allows 
me  to  measure  other  interesting  characteristics  of  the  student 
network.  As  far  as  the  unsparsed  networks  are  concerned,  I 

expect  them  to  be  able  to  learn  any  mapping  that  a  sparsed,  with 

the  same  number  of  neurons,  generated. 


Sparseness  versus  Learning  time 

I  ran  the  above  procedure  on  a  20-10-2  layered  network  (^0  inputs 
units,  10  hidden  and  2  output  units)  with  increasing  values  for  the 


probability  of  a  link  (with  P  =  0.2,  0.3,  0.5,  0.7).  From  the  Figure 
series  (A)  the  following  are  evident; 

( 1 )  When  the  level  of  sparseness  for  the  generating  network  is  high 

(low  P)  both  the  unsparsed  and  sparsed  learn  very  fast  (Figure 
(A.i)  .  The  explanation  for  that  is  as  follows:  the  training 

vectors  have  very  low  variance  (i.e.  the  output  of  each  vector 
are  very  similar,  within  1%);  this  is  due  to  the  very  high 
sparsness  which  limits  the  output  values  of  vectors  to  a  narrow 
region  in  state  space,  since  no  matter  which  input  vector  we 
put,  it  will  have  the  same  effect  as  some  input  vector  X  which 
has  many  zeros;  therefore  many  different  inputs  will  be  mapped 
to  a  vector  X  and  will  result  in  its  output.  So  sparseness 
decreased  the  range  of  the  input-output  transformation,  and 
there  is  now  much  less  to  learn  (since  all  the  vectors  can  be 
considered  as  a  few  distinct  vectors),  and  that  is  why  whoever 
is  learning,  will  do  very  well  fast. 

(2)  The  variance  in  the  output  of  the  vectors  that  are  to  be 
taught,  increases  as  sparsness  of  the  generating  network 
decreases,  and  therefore  it  takes  the  student  networks  more  time 
to  learn  as  the  generating  network  becomes  less  sparse  (since 
there  are  more  distinct  vectors  to  learn).  There  are  several 
runs  per  figure,  and  one  should  look  at  the  average  behavior 
when  judging  the  results.  Note  that  the  Backpropagation 
algorithm  can  get  stuck  in  a  local  error  minimum  instead  of 
converging  to  the  global  error  minimum,  and  that  is  why  some  of 
the  runs  don't  converge;  this  is  the  difficult  part  of  trying  to 
analyze  characteristics  of  neural  networks  by  studying  them 
through  an  unperfect  learning  algorithm, 

(3)  On  the  average,  sparsed  networks  learn  faster  than  the 

unsparsed  networks.  In  Figures  (A. 2), (A. 3)  the  unsparsed  is 
slower  from  the  0.2  sparsed  by  approximately  67.,  in  Figures 

(A. 4),  (A. 5)  the  unsparsed  is  slower  by  8.37.,  in  Figures 

(A. 6),  (A. 7)  the  unsparsed  is  slower  by  167.. 


Learning  the  8-5-8  Encoder  Problem 


Here  the  taught  vectors  were  not  generated  by  an  original  network,  but 


Vector ( 1 ) ; 

10000000 

10000000 

Vector ( 2 ) : 

01000000 

01000000 

Vector ( 3 ) : 

00100000 

00100000 

Vector ( 4 )  : 

00010000 

00010000 

Vector ( 5) : 

00001000 

00001000 

Vector ( 6 ) : 

00000100 

00000100 

Vector ( 7 )  ; 

00000010 

00000010 

Vector (8) : 

00000001 

00000001 

Eight  different  sparsed  networks  were  built  (each  having  B  input 
units,  5  hidden  units  and  8  output  units).  As  Figure  (B.l)  shows,  the 
less  sparsness  the  faster  the  learning.  The  explanation  is  as  follows: 

Sparsed  nets  have  smaller  capacity,  because  removing  weights  decreases 
the  number  of  possible  functions  (mappings)  that  they  can  realise; 
this  is  simply  because  there  are  less  variables  to  permute  (i.e  taking 
the  view  that  each  function  is  a  permutation  of  weights);  one  might 
argue  that  since  there  are  infinitely  real  numbers  that  weights  may 
take,  it  would  imply  that  removing  weights  will  not  reduce  the  number 
of  realisable  functions  (since  the  remaining  weights  will  still  range 
over  infinitely  many  values)  but  this  is  clearly  wrong  since  it  is 
not  just  the  value  of  the  link  that  characterizes  a  given  function, 
but  also  its  position  in  the  overall  architecture,  and  there  is  a 
finite  number  of  positions  that  links  can  occupy  (since  there  are 
finite  number  of  neurons)  therefore  if  enough  links  are  removed  to 
isolate  an  output  neuron  then  although  all  other  remaining  links  may 
take  infinitely  many  values,  they  still  cannot  influence  the 
isolated  output  neuron  (therefore  the  number  of  functions  realisable 
has  been  reduced).  And  for  a  given  set  of  points  (or  vectors  to  be 
learned)  it  is  much  more  probable  to  find  a  function  that  contains 
these  points  in  its  solution,  in  a  network  that  can  realise  more 
functions  to  begin  with  (i.e.  the  unsparsed  NN)  whereas  in  the 
sparsed,  the  number  of  realisable  functions  is  smaller,  therefore  the 
probability  of  finding  a  function  (i.e.  amongst  the  small  number  of 
realisable  functions)  that  has  the  points  in  its  solution  set,  is  much 
smaller.  Therefore,  the  unsparsed  network  can  with  a  higher 


probability  than  the  sparsed  network,  learn  a  given  set  of  M  vectors, 
therefore  its  capacity  is  laVger  than  the  sparsed. 

In  Figure  (B.l),  the  function  that  was  to  be  taught  was  not  realisable 
by  the  very  sparsed  networks  ( as  opposed  to  the  previous  section  which 
dealt  with  teaching  a  real isable  function  to  sparsed  networks ) .  That 
is  the  reason  that  the  very  sparsed  networks  were  not  able  to  learn; 
their  capacity  is  too  small.  As  we  decreased  the  level  of  sparsness  we 
in  effect  increased  the  capacities,  and  thus  we  see  an  improvement  in 
the  learning.  This  is  because  the  function  that  we  taught  was 
realisable  with  a  greater  probability  in  the  less  sparsed  networks. 

In  summary,  the  above  two  sections  imply  the  following:  a  sparsed 
network  will  learn  faster  than  an  unsparsed  when  its  architecture  can 
realise  the  function  that  it  is  trying  to  learn.  It  learns  faster 
because  its  architecture  permits  it  to  realise  much  fewer  functions 
and  when  given  M  distinct  points  (vectors)  it  needs  to  sort  out  less 
possible  functions  (in  contrast  to  an  unsparsed  network)  that  contain 
these  points  in  their  solution,  and  thus  completes  the  sorting  faster 
than  an  unsparsed.  However,  when  given  an  unrealisable  function  (or 
function  that  is  only  nearly  realisable)  the  sparsed  network  either 
cannot  learn  or  learns  slower  than  the  unsparsed  network  (this  follows 
from  the  same  argument). 


Experimental  results  on  capacity  of  feed-forward  layered  networks 

In  all  the  following  experiments,  sparsing  was  done  only  on  the  links 
that  connect  the  input  to  the  hidden  layer.  The  following  setup  was 
used  in  every  trial  of  the  experiment; 

(1)  An  original  sparsed  network  is  built  according  to  a  certain 

architecture  interconnection  matrix.  The  main  parameter  that 
selects  a  given  architecture  is  the  probability  (P)  for  the 
existence  of  a  link  connecting  the  input  layer  to  the  hidden 

layer);  the  weights  assigned  to  the  links  are  picked  randomly. 
Then  several  binary  vectors  are  placed  individually  at  the 

input  layer,  and  the  network  generates  a  real  valued  response 


at  the  output  layer.  The  input-output  pairs  are  then  placed 
aside  for  use  in  the  next  steps. 

(2)  A  sparsed  network  with  same  architecture  (same  P  but  different 
random  weights  for  the  links)  is  built.  An  Unsparse  network  is 
built. 

(3)  Teach  the  unsparsed  and  sparsed  networks  on  the  vectors 

generated  by  the  original  sparsed  net,  showing  a  randomly 
selected  percentage  of  the  vectors,  and  repeatedly  learning  on 
the  range  of  (20/i,90%).  Record  the  overall  error  versus  the 

number  of  learning  cycles  (1  fixed  the  maximum  number  of 
cycles).  The  reason  that  I  teach  different  percentage  of  the 
points  that  define  the  taught  function,  is  because  this  will 
determine  the  critical  percentage  point  which  is  (according  to 
PAC  learning)  proportional  to  the  capacity  of  the  system  that  is 
doing  the  learning  (in  this  case  it  is  a  neural  network). 


I  started  with  a  5-4-2  architecture  (Figure  Series  D)  .  1  built  a  0.2 
probability  original  sparsed  network  with  which  I  generated  32  binary- 
input  real-  output  vectors.  I  built  a  5-4-2  sparsed  with  the  same 
connectivity  and  a  5-4-2  unsparsed  and  trained  them  on  20'/.,  30'/.,  40*/., 
50/(,  hOV, ,  70X,  BOX,  90X  of  the  above  generated  vectors  (averaged  over 
6  runs  per  trial  e.g.  in  the  50X  trial  it  means  that  the  number  of 
vectors  shown  is  16,  but  the  set  of  vectors  shown  in  each  case  can  be 
different  16  vectors).  I  recorded  the  resulting  Error  versus  Number  of 
learning  cycles  versus  Percentage  shown.  I  repeated  the  above  for  a 
0.4,  0.5,  0.6,  0.8,  0.9  probability  generating  sparsed  network. 

When  comparing  the  critical  points  for  the  unsparsed  network  with  that 
of  the  corresponding  sparsed  network  that  learned  the  vectors 
generated  by  the  same  architecture  network,  we  see  that  always,  the 
unsparsed  critical  point  is  higher  than  the  sparsed  (as  we  expect).  In 
Figures  (D.6)  and  (D.7),  I  ran  0.5  sparsed  networks  that  learned  4 
different  sets  of  O . 5-sparsed-generated  vectors;  it  is  evident  that  in 
all  4  cases,  the  critical  point  is  approximately  the  same  (50*/.  shown). 
This  result  goes  nicely  with  the  expectation  that  the  capacity  (and 
thus  the  critical  point  which  is  proportional  to  it)  is  independent  of 


the  problem  learned.  Figure  (E.l)  shows  the  Error  vs  Per cen taae— Shown 
at  the  700th  learning  cycle.  The  point  where  the  error  starts  to 
follow  the  asymptote  (i.e.  where  the  derivative  is  relatively  small), 
is  the  critical  point.  The  relationship  of  sparseness  to  critical 
point  for  the  5-4-2  is  as  follows: 

Probability  of  a  link  Critical  Point; 


0.2 

30% 

0.4 

40X 

0.5 

507. 

0.6 

507. 

0.0 

60% 

0.9 

707. 

I  repeated  the  above  with  a  6-4-1  architecture  (Figure  Series  F); 
there  were  2*=64  generated  vectors  to  be  learned.  1  only  display 
(Figure  F)  as  an  example  of  one  sparsed  network  (0.5  sparsed).  In 
reality,  6  different  sparsed  networks  were  built  for  each  percentage 
(20*/.,  30*/.,  407.,  507.,  607.,  707.,  807.,  907.)  of  each  sparsed  case  (i.e. 
for  0.2,  0.4,  0.5,  0.6,  0.8,  0.9).  In  total,  288  networks  were  built 
in  this  experiment.  Then  I  averaged  the  6  responses  per  a  specific 
percentage,  and  plotted  the  graphs  (Figures  F.l  -  F.4).  As  seen  in 
Figure  Series  (F),  the  relationship  of  sparseness  to  critical  point 
for  the  6-4-1  is  as  follows: 

Probability  of  a  link  Critical  Point: 


0.2 

407. 

0.4 

507. 

0.5 

507. 

0.6 

35% 

CO 

■ 

o 

45% 

0.9 

607. 

I  repeated  the  above  with  a  7-7-1  architecture  (Figu-'e  Series  G); 
there  were  2^=64  generated  vectors  to  be  learned.  In  total,  288 

Then  I  averaged  the  6 


networks  were  built  in  this  experiment. 


responses  per  a  specific  percentage,  and  plotted  the  graphs  (Figures 
G.l  -  G.4).  As  seen,  the  relationship  of  sparseness  to  critical  point 
for  the  7-7-1  network  is  as  follows: 

Probability  of  a  link  Critical  Point; 
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407. 

0.4 

50% 

0.5 

407. 

0.6 
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0.8 
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0.9 
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Figure  (K.l)  shows  the  results  of  the  critical  point  measurements  for 
the  5-4-2,  6-4-1,  7-7-1  architectures. 

In  the  above,  each  network  was  learning  a  problem  of  different  size, 
i.e.  the  5-4-2  learned  a  function  that  had  32  input-output  vectors, 
the  6-4-1  learned  a  function  that  had  64  input-output  vectors,  the  7- 
7-1  learned  a  function  that  had  128  input-output  vectors.  In  order  to 
compare  networks  of  different  architectures  that  learn  the  same  size 
problem,  I  ran  the  following  experiment.  I  repeated  the  above  3-step 
procedure  with  an  8-8-1  and  a  10-10-1  learning  a  set  of  vectors  that 
was  generated  by  a  6-4-1  network,  i.e.  the  problem  had  64  input-output 
vectors.  In  total,  576  networks  were  built  in  this  experiment.  The 
results  are  shown  in  Figure  Series  (H),  and  Figure  Series  (J).  The 
relationship  of  sparseness  to  critical  point  in  the  B-8-1  network  is 
as  follows: 


Probability  of  a  link  Critical  Point: 


0.2 

307. 
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O 

407. 

0.5 

507. 

0.6 

50% 

0.8 

607. 

0.9 

707. 

And  for  the  10-10-1  network: 


Probability  of  a  link  Critical  Point: 


0.2 

40% 

0.4 

50% 

O.S 

60% 

0.6 

67% 

O.B 

70% 

0.9 

85% 

Figure  (K.2)  shows  the  results  of  the  critical  point  measurements  for 
the  6-4-1,  8-8-1,  10-10-1  architectures.  Figure  (K.3)  shows  the 
critical  points  as  function  of  the  maximum  number  of  links. 

Section  Summary 

The  following  are  evident  from  Figure  Series  (K): 

(1)  As  sparseness  decreases  (increasing  probability  of  a  link 

beyond  0.5)  the  critical-point  curves  branch  out  from  each 

other.  This  may  suggest  that  the  capacity  is  a  linear  function 
of  sparsness  with  a  slope  determined  by  the  number  of  units  (the 
specific  architecture)  in  the  network. 

(2)  The  plot  (Figure  K.2)  of  the  networks  that  learned  the  same 
size  problem  was  expected  since  the  smaller  network  (6-4-1) 
should  have  a  critical-  point  curve  lower  than  the  8-8-1  (which 
is  evident  in  the  plot).  But  the  plots  of  Figure  (K.l)  are  not 
as  trivial.  I  would  expect  the  5-4-2  network  to  have  a  critical- 
point  curve  lower  than  the  6-4-1  (which  is  not  evident).  This 
suggests  that  there  may  be  some  non-linear  relationship  between 
a  specific-architecture  network  to  its  learnability  of  different 
size  problems,  thus  allowing  for  a  6-4-1  network  to  learn  a  set 


of  64  vectors  with  less  percentage  of  examples  than  for  a  5-4-2 
network  to  learn  a  set  of  32  vectors,  and  thus  giving  the  false 

impression  that  the  6-4-1  has  less  capacity.  In  other  words, 
when  using  an  algorithm  to  measure  the  critical  error,  we  must 
make  sure  that  the  initial  conditions  (in  this  case  the  problems 
being  learned)  are  as  uniform  (i.e.  of  the  same  size)as  possible 
across  all  the  student  networks. 


(4)  Conclusions 

This  work  was  concerned  in  getting  qualitative  insight  to 
the  relationship  of  learning  speed  versus  sparsness,  and 

relative  capacities  versus  sparsness.  The  experiments  were 

carried  out  as  computer  simulations,  and  the  majority  of  the 
results  have  been  expected.  The  main  points  that  can  be  suggested 
by  the  work  are  summarized  below: 

—  A  sparsed  network  will  learn  faster  than  an  unsparse  net  (having 
the  same  number  of  neurons)  when  its  architecture  can  realxse 
the  function  that  it  is  trying  to  learn.  When  given  an 
unrealisable  function  (or  function  that  is  only  nearly 
realisable)  the  sparsed  network  either  cannot  learn  or  learns 
slower  than  the  unsparsed  network. 


—  The  capacity  is  approximately 
a  slope  determined  by  the 
architecture)  in  the  network, 
greater  the  capacity  and  thus 
input-output  mappings,  i.e. 
functions . 


linear  function  of  sparsness  with 
number  of  units  (the  specific 
The  less  sparsed  the  network,  the 
it  is  able  to'  lea'-n  more  types  of 
realise  a  bigger  set  of  possible 


I.e. 
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The  focus  of  the  paper  is  the  estimation  of  the  maximum  number  of  states  that 
can  be  made  stable  in  higher-order  extensions  of  neural  network  models.  Each 
higher-order  neuron  in  a  network  of  n  elemtnts  is  modeled  as  a  polynomial  thresh¬ 
old  element  of  degree  d.  It  is  shown  that  regardless  of  the  manner  of  operation,  or 
the  algorithm  used,  the  storage  capacity  of  the  higher-order  network  is  of  the 
order  of  one  bit  per  interaction  weight.  In  particular,  the  maximal  (algorithm 
independent)  storage  capacity  realizable  in  a  recurrent  network  of  n  higher-order 
neurons  of  degree  d  is  of  the  order  of  n^ldl,  A  generalization  of  a  spectral  algo¬ 
rithm  for  information  storage  is  introduced  and  arguments  adducing  near  optimal 
capacity  for  the  algorithm  are  presented,  o  im  Acwjcmk  Press,  inc. 


1.  Introduction 

A  formal  neuron  (after  McCulloch  and  Pitts,  1943)  is  defined  as  a  linear 
threshold  element  which  accepts  n  inputs  and  computes  a  binary  output 
based  on  the  sign  of  a  linear  form  of  the  inputs.  When  n  such  elements  are 

*  Presented  in  part  at  the  IEEE  Conference  on  Neural  Information  Processing  Systems, 
Denver,  Colorado,  November,  1987,  and  at  the  IEEE  International  Symposium  on  Informa¬ 
tion  Theory,  Kobe,  Japan,  June,  1988. 
t  Corresponding  author. 

t  Also  Division  of  Biology,  California  Institute  of  Technology,  Pasadena,  CA  91125. 
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interconnected  with  the  output  of  each  neuron  serving  as  input  to  all  the 
neurons  in  the  network,  a  closed  feedback  system  results  with  dynamics 
described  by  trajectories  on  the  vertices  of  the  n-cube.  Each  vertex  de¬ 
fines  a  possible  state  of  the  recurrent  network,  and  we  identify  the  vector 
of  neural  outputs  as  the  (instantaneous)  state  of  the  system.  The  fixed 
points  (or  stable  states)  of  such  recurrent  networks  are  of  importance  in 
their  computational  characterization;  in  particular,  we  are  interested  in 
the  following  question:  What  is  the  maximum  number  of  arbitrarily  speci¬ 
fied  vertices  that  can  be  made  stable  in  a  recurrent  neural  network  by 
suitable  selection  of  neural  interconnectivity? 

In  this  paper  we  focus  on  recuiTent  networks  where  the  computational 
elements  are  higher-order  extensions  of  the  basic  linear  threshold  neural 
model.  Each  higher-order  neuron  is  a  polynomial  threshold  element  of  a 
given  degree  d.  If,  in  a  recurrent  network  of  n  higher-order  neurons,  the 

current  outputs  (states)  of  the  neurons  area . 1 , 1},  then  an 

update,  of  the  state  of  the  /|th  neuron  is  given  by  the  sign  of  an 
algebraic  form 

K  =  sgn  (  '2  H’„„  .  •  •  •  Ui,J.  (1) 

The  number  of  degrees  of  freedom  in  choosing  the  interaction  coefficients 
(or  weights)  is  increased  to  n*'*'  from  the  n^  weights  for  the  case  of 

linear  interactions.  The  added  degrees  of  freedom  in  the  interaction  coeffi¬ 
cients  can  potentially  result  in  enhanced  flexibility  and  programming  ca¬ 
pability  over  the  linear  case  as  has  been  noted  independently  by  several 
authors  (Lee  et  ai,  1986;  Psaltis  and  Park,  1986;  Baldi  and  Venkatesh, 
1987,  1988). 

We  rigorously  estimate  the  storage  capacity  of  recurrent  higher-order 
neural  networks:  specifically,  we  calculate  the  maximum  number  of  arbi¬ 
trarily  specified  vectors  that  can  be  made  stable  in  a  recurrent  network  of 
n  polynomial  threshold  units  of  degree  d.'  All  our  results  point  in  the 
following  direction. 

Regardless  of  the  manner  of  operation,  or  the  algorithm  utilized,  the 
storage  capacity  of  a  higher-order  network  of  degree  d  is  of  the  order  of  I 
memory  bit  per  interaction  coefficient.  And  in  particular: 

•  The  storage  capacity  of  the  outer-product  algorithm  generalized  to 
networks  of  degree  d  is  of  the  order  of  n'^/log  n  memories  {with  constants 
depending  on  the  variant  employed); 

'  Cases  where  networks  have  random  interaction  coefficients  (instead  of  the  programmed 
scenario  here)  lead  to  entirely  different  computational  issues.  We  deal  with  these  in  a 
concurrent  paper  (Venkatesh  and  Baldi,  1989). 
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•  The  maximal  (algorithm  independent)  storage  capacity  realizable 
in  a  higher-order  neural  network  of  degree  d  is  of  the  order  of  n‘‘/d\; 

•  Near  optimal  storage  capacities  of  the  order  of  n‘‘ld\  memories  can 
be  obtained  by  variants  of  the  spectral  algorithm. 

In  this  paper  we  set  up  the  basic  definitions  in  Section  2,  construct  a 
spectral  based  algorithm  with  near  optimal  capacity  in  Section  3,  and 
rigorously  estimate  the  maxima]  (algorithm  independent)  capacity  of  a 
network  of  given  degree  in  Section  4.  In  a  concurrent  paper  we  include 
the  capacity  calculations  for  the  outer-product  algorithm  generalized  to 
degree  d  (Venkatesh  and  Baldi,  1991). 

Notation.  Let  {x„}  and  {y„}  be  positive  sequences.  We  use  the  follow¬ 
ing  standard  asymptotic  notation: 

1 .  x„  =  0(  yn)  if  there  is  a  positive  constant  L  such  that  x„/y,  s  L  for 
all  n; 

2.  X,  ~  y,  if  x,/y,  -*  1  as  n  -»  »; 

3.  X,  =  o(y,)  if  x,/y, -»  0  as  n-*  00. 

By  almost  all  we  mean  all  but  an  asymptotically  negligible  subset:  specifi¬ 
cally,  if  A„  denotes  a  sequence  of  finite  sets,  and  9  is  some  attribute,  we 
say  tha*  almost  all  elements  of  A,  exhibit  9^  if  the  subsets  B„  Q  An  for 
which  9  holds  are  such  that  |5,|  ~  |i4,|  as  n  -»  ».  We  denote  by  B  the  set 
{-1,  1},  and  by  («]  the  set  of  indices  {1,  2,  ...  ,  n}  for  any  positive 
integer  n.  Finally,  by  an  ordered  multiset  we  mean  an  ordered  collection 
of  elements  where  repetition  is  allowed. 


2.  Higher-Order  Neural  Networks 
2.1.  Polynomial  Threshold  Units 

We  consider  recurrent  networks  of  polynomial  threshold  units  each  of 
which  yields  an  instantaneous  state  of  -1  or  -)-l.  More  formally,  for 
positive  integers  n  and  d,  let  be  the  set  of  ordered  multisets  of  cardinal¬ 
ity  d  of  the  set  [«].  Clearly  =  /i**.  For  any  subset  /  of  [n],  and  for  every 
U  =  (U|  U2  ■  •  •  U„)  e  B",  set  U,  =  n,e;  Ui. 

Definition  2.1.  A  fully  interconnected  higher-order  neural  network 
of  degree  d  is  characterized  by  a  set  of  real  weights  w^iJ)  indexed  by 
the  ordered  pair  (i,/)  with  i  E.  (nj  and  /  E  and  a  real  margin  of 
operation  ®  s  0.  The  network  dynamics  are  described  by  trajectories  in  a 
state  space  of  binary  n-tuples,  B":  for  any  state  u  G  B"  on  a  trajectory,  a 
component  update  u,  t-»  u,'  is  permissible  iff 
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'-1 

.1 


if  <  -a 

if  -a  <  <  a 

if 


(2) 


The  evolution  may  be  synchronous  with  all  components  of  u  being  up¬ 
dated  according  to  the  rule  (2)  at  each  epoch,  or  asynchronous  with  at 
most  one  component  being  updated  per  epoch  according  to  Eq.  (2). 

The  network  is  said  to  be  symmetric  if  w^j)  =  whenever  the 
{d  +  l)-tuples  of  indices  (/,  D  and  (/,  J)  are  permutations  of  each  other. 
The  network  is  said  to  be  zero-diagonal  if  =  0  whenever  any  index 
repeats  in  (/,  /). 

Let  denote  the  set  of  all  subsets  of  d  elements  from  («];  |.?rf|  =  (3). 
Combining  all  redundant  terms  in  Eq.  (2),  for  symmetric,  zero-diagonal 
networks  a  component  update  Ui  h*  «/  is  permissible  iff 


u 


if  w^iJ^u,  <  -a 
if  -a  ^  <  a 

if  ^teKitJ  >  a. 


(3) 


(If  the  network  is  symmetric  and  zero-diagonal  then,  for  each  nonzero 
coefficient  — i.e.,  coefficients  w^j)  for  which  no  index  repeats  in  (/, 
D — the  term  W(,j)W/  occurs  d\  times  in  the  sum  S/g#,  ^\ij)Ui.  Hence,  2/g>^ 
RdjjM/  =  d\  *V(ij)Ui.  The  constant  scale  factor  d\  is  removed  in  Eq. 

(3)  as  this  is  just  equivalent  to  scaling  the  margin.) 

The  choice  of  margin  of  operation  essentially  specifies  the  “strength” 
of  the  desired  interaction.  A  choice  of  margin  a  =  0  leads  to  standard 
threshold  operation.  For  a  choice  of  nonzero  margin  of  operation,  a  bit, 
Ui,  retains  its  sign  if  and  only  if  the  corresponding  weighted  sum  multi¬ 
plied  by  Ui  exceeds  a;  otherwise  its  sign  is  reversed. 

These  networks  are  seen  to  be  natural  generalizations  to  higher-order  of 
the  familiar  case  of  linear  threshold  networks  {d  =  1).  While  networks  of 
polynomial  threshold  units  require  more  computationally  powerful  units 
than  linear  threshold  functions,  each  polynomial  threshold  element  (sub¬ 
scribing  to  rule  (2)  or  to  rule  (3))  can  be  replaced  by  a  small,  equivalent 
network  of  linear  threshold  units.  To  see  this  note  that  it  suffices  to  be 
able  to  realize  each  individual  product  of  components,  w/  =  n/=i  m,^,  for 
each  choice  of  /  =  (ij,  h,  .  .  .  ,  U)  E  ^d,  as  the  results  of  all  these 
computations  can  be  combined  with  a  single  linear  threshold  gate  to  real¬ 
ize  the  desired  output.  Now,  for  each  /  E  Sd,  realizing  the  product  of 
components  «/  is  equivalent  to  checking  the  parity  of  the  d  bits  «,,, 
Ui,,  .  .  .  ,  Ui^  in  the  product.  It  suffices,  hence,  to  show  that  parity  can  be 
computed  by  small  circuits  of  linear  threshold  units.  But  this,  in  fact,  is  a 
special  case  of  a  more  general  known  result  that  any  symmetric  Boolean 
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function — i.e.,  functions  which  are  invariant  under  any  permutation  of 
the  inputs,  parity  being  an  example — can  be  computed  by  small  circuits  of 
linear  threshold  elements.  For  completeness,  we  sketch  a  short  proof  of 
this  result  below. 

Proposition  2.2.  Any  symmetric  Boolean  function  on  d  variables  can 
be  computed  by  a  linear  threshold  circuit  of  depth  two  and  linear  size  ',  in 
particular,  d  threshold  elements  in  the  first  layer  and  a  single  output 
threshold  element  in  the  second  layer  are  always  sufficient. 

Proof.  The  proof  is  constructive.  Array  the  1‘‘  possible  inputs  of  ±  1  </- 
tuples  in  (<f  +  1)  rows  with  the  elements  in  each  row  being  permutations 
(i.e.,  alt  </-tuples  in  a  row  have  the  same  number  of  +  I’s),  the  lowest  row 
containing  the  single  ^/-tuple  which  has  no  +rs,  and  with  the  number  of 
+  rs  increasing  monotonically  with  the  rows  to  the  final  {d  +  l)th  row 
which  contains  the  single  </-tuple  whose  components  are  all  +1.  Any 
symmetric  Boolean  function  clearly  assumes  the  same  value  for  all  ele¬ 
ments  (Boolean  ^/-tuples)  in  a  row.  Hence,  for  any  given  symmetric  func¬ 
tion,  contiguous  rows  where  the  function  assumes  the  value  -fl  form 
bands  which  are  separated  by  contiguous  rows  where  the  function  as¬ 
sumes  the  value  -1.  This  is  illustrated  schematically  in  Fig.  la.  Now 
assume  there  are  b  bands  where  the  function  assumes  the  value  +1. 
(There  are  at  most  dll  such  bands — the  worst  case  occurring  for  the 
parity  function.)  The  function  can  now  be  computed  by  a  circuit  with  2b 
linear  threshold  elements  in  the  first  layer  and  a  single  linear  threshold 
element  in  the  second  layer  as  illustrated  in  Fig.  lb.  (Each  linear  threshold 
unit  produces  a  -i- 1  if  the  weighted  sum  of  all  its  inputs  exceeds  its  thresh¬ 
old,  and  produces  a  -1  otherwise.)  ■ 

2.2.  Capacity 

As  in  any  dynamical  system,  the  fixed  points  are  important  in  the  char¬ 
acterization  of  the  system  dynamics. 

Definition  2.3.  Let  ®  >  0  be  fixed.  A  state  u  G  B"  of  a  fully  intercon¬ 
nected  network  is  said  to  be  ^-stable  iff 

««  S  *v„z)«/  >  a,  1=1 . n. 

Likewise,  a  state  u  G  B"  of  a  zero-diagonal  network  is  said  to  be  ^-stable 
iff 


2  ^UJ)Ul  >  > . "• 
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Fig.  I.  (a)  A  symmetric  Boolean  function/of  inputs,  (b)  A  realization  of  the  symmetric 
Boolean  function  /  with  a  linear  number  (in  </)  of  linear  threshold  elements  arrayed  in  a  depth 
2  circuit. 

It  is  easy  to  see  that  ^-stable  states  are  fixed  points  of  the  higher-order 
network  with  evolution  under  a  margin  98. 

The  fixed  points  of  the  network  take  on  particular  significance  when  the 
network  interconnections  are  symmetric.  In  this  case,  under  suitable 
modes  of  operation,  Liapunov  functions  can  be  shown  for  the  system 
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(Hopfield,  1982;  Coles  and  Vichniac,  1986;  Maxwell  et  al.,  1986;  Venka- 
tesh  and  Baldi,  1989).  In  particular,  each  fixed  point  exhibits  an  attrac¬ 
tion  basin;  trajectories  passing  through  states  in  the  attraction  basin  of  a 
fixed  point  ultimately  converge  to  the  fixed  point.  This  geometric  picture 
is  particularly  persuasive  in  associative  memory  applications;  if,  by  ap¬ 
propriate  choice  of  weights,  data  is  stored  as  fixed  points  of  the  network, 
then  the  network  functions  as  an  error-correction  mechanism  and  identi¬ 
fies  states  sufficiently  similar  to  a  stored  datum  with  the  datum. 

In  this  paper  we  do  not  insist  on  symmetry  in  the  choice  of  weights.  We 
refer  to  the  data  to  be  stored  as  memories.  By  an  algorithm  for  storing 
memories  we  mean  a  prescription  for  generating  the  interaction  weights  of 
a  higher-order  network  of  degree  </  as  a  function  of  any  given  set  of 
memories.  We  investigate  the  maximum  number  of  arbitrarily  specified 
memories  that  can  be  made  fixed  in  the  network  by  an  algorithm;  this  is  a 
measure  of  the  capacity  of  the  algorithm  to  store  data. 

Letu‘, .  .  .  ,  u"  6  B"  be  an  m-set  of  memories  to  be  stored  in  a  higher- 
order  network  of  degree  d.  We  assume  that  the  memories  are  chosen 
randomly  from  the  probability  space  of  an  unending  series  of  symmetric 
Bernoulli  trials;  specifically,  the  memory  components,  uf,  i  £  [n],  a  £ 
[m],  are  i.i.d.  random  variables  with 

P{«f  =  -I}  =  P{uf  =  +l}  =  i. 

In  the  following  we  assume  that  the  network  architecture  is  specified  to  be 
a  higher-order  network  of  degree  d  operating  under  a  margin  ®. 

Definition  2.4.  We  say  that  ^  is  a  lower  capacity  function  (or  sim¬ 
ply,  lower  capacity)  for  an  algorithm  if  for  every  0  <  k  <  1 ,  and  m  ^  (1  - 
X)^,  the  probability  that  ail  the  memories  are  fixed  points  of  the  network 
generated  by  the  algorithm  tends  to  one  as  n 

Likewise,  ^  is  a  maximal  lower  capacity  if  for  every  0  <  X  <  1 ,  and 
m  s  (1  -  X)Q,  the  probability  that  there  is  some  network  in  which  all  the 
memories  are  9B-stable  approaches  one  as  n  -»  ». 

Definition  2.5.  We  say  that  C„  is  an  upper  capacity  function  (or 
simply,  capacity)  for  an  algorithm  if  for  every  0  <  X  <  1 ,  and  m  a 
(1  +  X)C„,  the  probability  that  at  least  one  of  the  memories  is  not  a  fixed 
point  of  the  n^work  generated  by  the  algorithm  tends  to  on^as  n  -»  «. 

Likewise,  ^  is  a  maximal  upper  capacity  if  for  every  0  <  X  <  1 ,  and 
m  >  (1  X)C„,  the  probability  that  there  is  a  network  in  which  all  the 

memories  are  SS-stable  approaches  zero  as  n  -»  ®. 

Remarks.  The  first  definition  yields  an  underestimate  of  algorithm/ 
network  capability,  while  the  second  definition  gives  an  overestimate. 
Note  that  the  definitions  of  maximal  capacity  are  algorithm  independent, 
and  bound  any  algorithmic  capacity  from  above.  It  is  clear  that  both  lower 
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and  upper  capacities  always  exist,  and  are  not  unique.  What  is  more, 
there  does  not  exist  a  largest  lower  capacity  or  a  smallest  upper  capacity 
as  the  following  proposition  indicates.  The  proof  is  an  immediate  conse¬ 
quence  of  the  definitions. 

Proposition  2.6.  (a)  is  a  tower  capacity,  then  so  is  ^[1  ±  o(l)]. 
(b)  If  Ck  is  an  upper  capacity,  then  so  is  C,[l  ±  o(I)]. 

We  combine  the  lower  and  upper  estimates  of  capacity  to  obtain  the 
following: 

Definition  2.7.  C„  is  a  capacity  function  (or  simply,  capacity)  for  an 
algorithm  iff  it  is  both  a  lower  and  an  upper  capacity  for  the  algorithm;  it  is 
a  maximal  capacity  iff  it  is  both  a  maximal  lower  and  a  maximal  upper 
capacity. 

Remarks.  Capacity  follows  a  0-1  law.  The  probabilistic  setup  we  es¬ 
pouse  requires  almost  all  sequences  of  memories  within  capacity  to  be 
storable  as  fixed  points  within  the  network.  Capacity,  hence,  reflects 
typical  behavior.^  Figures  2a  and  2b  indicate  the  threshold  behavior  of 
capacity. 

Unlike  lower  and  upper  capacity  functions,  capacity  functions  are  not 
guaranteed  to  exist.  If  a  capacity  function  exists,  however,  then  it  is  not 
unique. 

Proposition  2.8.  //C,  is  a  capacity  function,  then  so  is  C,I1  ±  o(l)J; 
conversely,  if  C,  and  Cl,  are  two  capacity  functions,  then  C„  - 

Proof.  The  first  part  follows  trivially  because  C„  is  both  a  lower  and  an 
upper  capacity.  To  prove  the  converse,  let  C„  and  Ch  be  any  two  capacity 
functions.  Without  loss  of  generality,  let  Ci  =  [1  +  a,]C,.  We  must  prove 
that  |a,|  =  o(l). 

Let  p  denote  the  probability  that  all  the  memories  are  fixed  points  of  the 
network.  Fix  X,  X'  £  (0,1).  Form  s  (1  -  X')Ci  =  (1  -  X'XI  +  oJC,,  we 
have  p  -+  I  as  n  -» 00.  Further,  for  mz(l  +  X)C,,  we  have  p  -»  0  as  n  -» 
00.  Hence,  for  every  choice  of  scalars  X,  X'  £  (0,1),  we  require  that 


for  large  enough  n.  It  hence  follows  that  |a,|  =  o(!).  ■ 


>  The  definilions  -  ■:  c?r'acity  developed  in  this  paper  subsume  within  them  most  common 
notions  of  capacity,  .  r.d  can  be  easily  extended  in  various  ways  to  reflect  propenies  of 
memories  other  than  li.eie  stability.  For  other  variants,  cf.  Cover  (1965),  Vapnik  (1982), 
Abu-Mostafa  and  Si.  Jacques  (1985),  Venkatesh  (1986),  Baldi  (1988). 
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Fig.  2.  (a)  Lower  and  upper  capacity  functions;  denotes  the  probability  that  each  of  m 
randomly  chosen  memories  is  a  fixed  point  of  the  network.  (b)The  0-1  behavior  of  capacity. 


Thus,  if  capacity  functions  do  exist,  they  are  not  very  different  from 
each  other  asymptotically.  Define  the  equivalence  class  %  of  (lower/up- 
per)  capacities  by  C,,  Ci  G  ^  o  C,  ~  Ci.  We  call  any  member  of  *€  the 
Oower/upper)  capacity  (if  ^  is  nonempty). 


3.  The  Spectral  Algorithm 
3.1.  The  Linear  Case 

For  the  linear  case  d  =  \,  Venkatesh  and  Psaltis  (1989a),  and  Person- 
naz,  Guyon,  and  Dreyfus  (1985)  have  shown  constructions  which  effec¬ 
tively  shape  the  spectrum  of  the  matrix  of  interconnection  weights  to 
ensure  that  the  given  set  of  memories  is  stable,  while  obtaining  capacities 
linear  in  n.  The  construction  entails  a  selection  of  weight  matrix,  W,  such 
that  the  memories  u“  are  eigenvectors  of  W  with  positive  eigenvalues.  The 
basic  notion  used  is  that  if  a  matrix  U  is  of  full  rank  the  orthogonal 
projection  of  a  vector  x  into  the  space  spanned  by  the  columns  of  U  is 
given  by  (U^U)‘'U’"x. 

Let  05  ^  0  be  some  fixed  margin  of  operation,  and  consider  a  fully 

interconnected  network  of  degree  d  =  \.  Fix  m  ^  n,  and  let  X’” . 

X'"’  >  05  be  fixed  (but  arbitrary)  positive  real  numbers.  Let  u' . u"  G 

B"  be  an  m-set  of  memories  whose  components  are  drawn  from  a  se¬ 
quence  of  symmetric  Bernoulli  trials.  To  each  memory  u“  we  associate 
the  positive  constant  X*“’.  Let 


U  =  [u'  •  •  o'"] 
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be  the  />  X  m  matrix  of  memories,  and  let  A  be  the  diagonal  matrix 

fx*')  0  •••  0  ■) 

0  •••  0 


0  0  •••  X<"' 


The  spectral  algorithm  formally  specifies  the  matrix  of  interaction 
weights,  W  =  [wy],  according  to  the  following  rule:. 

W  =  UA{U'‘U)-'U'".  (4) 

Theorem  3. 1 .  For  d  =  1  and  any  choice  of  margin  9B  a  0,  the  spectra! 
algorithm  has  capacity  C,  =  /i. 

In  fact,  if  the  prescription  (4)  yields  well-defined  weights,  then  we  have 


Wu“  =  X<"V'. 


Each  memory  component  is  multiplied  by  a  positive  scalar,  X'"*  >  9,  so 
that  the  memories  are  fixed  points  under  evolution  according  to  the  rule 
(2).  As  a  linear  transformation  can  have  at  best  n  eigenvectors  with  dis¬ 
tinct  eigenvalues,  it  follows  that  n  is  an  upper  sequence  of  capacities  for 
the  algorithm.  The  fact  that  n  is,  in  fact,  the  capacity  of  the  algorithm  will 
follow  if  the  prescription  (4)  is  well  defined  for  m  ^  n  with  arbitrarily  high 
probability  for  large  n.  This  is  established  by  a  new  result  of  Kahn, 
Komids,  and  Szemerddi  (1990).  (This  is  a  refinement  of  the  basic  result 
proved  by  Komids  in  1%7.) 

Proposition  3.2.  Almost  all  n  ^  n  matrices  with  ±1  components 
have  full  rank',  more  precisely,  if  the  components  of  a  random  n  'X-  n 
matrix,  A„ ,  are  chosen  independently  and  with  equal  probability  i  from 
±  I ,  then  there  is  a  constant  \  <  b  <2  such  that  the  probability  that  A„  is 
nonsingular  is  1  -  0(f>""). 

The  spectral  rule  amounts  (in  synchronous  operation)  to  iteratively 
projecting  states  orthogonally  into  the  linear  space  generated  by  u', 
.  .  .  ,  u'",  and  then  taking  the  closest  point  on  the  hypercube  to  this 
projection.  While  the  algorithm  appears  to  be  non-Hebbian  and  nonlocal, 
nonetheless,  a  low  complexity,  recursive,  local  construction  can  be 
shown  for  the  algorithm  using  Greville's  theorem;  the  algorithm  is,  hence, 
attractive  as  an  associative  memory  as  it  combines  relatively  low  com¬ 
plexity  with  high  capacity  and  efficient  error-correction  (Venkatesh  and 
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Psaltis,  1989a).  This  approach  can  be  extended  to  higher-orders  as  we 
now  describe. 

3.2.  Generalization  to  Higher-Order 

Let  us  consider  the  degree  of  interaction  J  to  be  odd  for  definiteness. 
By  combining  terms  we  can  replace  the  summation,  for  each 

1=1,.  .  .  ,  n  in  the  evolution  rule  (2)  by  an  equivalent  sum  of  the  form 

i 

S  E  Uif  (5) 

it  odd 

For  u  G  B"  to  be  a  fixed  point  under  evolution  according  to  the  rule  (2)  it, 
hence,  suffices  that 

d 

«,  X  S  H>,y,  4M,,.  .  .  Mj,  >  a,  /=1 . rt.  (6) 

ko6d 


Now,  for  any  o  G  B*  let  us  define  the  kth  generation  of  a  to  be  the 
vector  u[/tl  G  B(I)  defined  by 


m  = 


U,U2 
U,Ul  • 


•  Uk-tUk 

I 


(7) 


in  other  words,  alit]  is  the  vector  formed  by  lexicographically  ordering  the 
(J)  products  of  components  of  u  taken  k  at  a  time.  We  now  form  the  vector 
u  from  the  first  fd/21  odd  generations  of  a: 


/«(I1\ 
ul3] ' 


\uld]/ 


(8) 


S  «)• 


to<M 


Now  set 
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Clearly,  a  is  a  binary  vector  with  Nj  components.  Let  W  denote  the  n  x 
Nj  matrix  of  coefficients,  w, in  Eq.  (5)  arranged  lexicographically; 
i.e.. 


Wtl 

•••  Wi, 

>Vli23  ■■■ 

W21 

•••  ^2^ 

W2I23 

W2.ii-2.<i-l.<i 

••• 

W’„|23  ••• 

WII2J4S  ■■■  . n-l./i 

W2I234S  ■■■  ^2.D-d*\.n~d*2 n-l./i 


^1112345  '**  l.nj 

Let  U  be  the  n  X  m  matrix  of  memories.  Form  the  extended  Nd  'x  m 
binary  matrix 


0  =  [u'  •  •  •  U"), 

where  u“  G  is  as  defined  above.  Let 

A  =  dg(X<'>,  ....  X*"’] 

be  an  m  X  m  diagonal  matrix  with  positive  diagonal  terms,  X'“*  >  ®.  We 
formally  define  the  generalized  spectral  matrix  of  coefficients,  W,  by 


W  =  UA(0^0)-'0^.  (9) 

Note  that  this  yields  stable  memories  as  long  as  the  matrix  0  is  full  rank. 
Specifically,  if  the  initial  state  is  one  of  the  memories,  u“,  then  we  obtain 

Wu"  =  X<">u“. 

It  is  now  easy  to  verify  that  Eq.  (6)  is  satisfied  for  each  component 
memory  u",  so  that  u“  is  a  fixed  point  under  evolution  according  to  the 
rule  (2).  If  the  degree  of  interaction,  d,  is  even,  the  exposition  follows  as 
above  with  the  first  sum  in  Eq.  (5)  being  over  even  k  instead  of  odd  k.  The 
maximal  allowable  rate  of  growth  of  m  with  n  follows  immediately. 
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Theorem  3.3.  An  upper  capacity  of  the  generalized  spectral  algo¬ 
rithm  of  degree  d  is 


In  particular,  if  d  =  o(n)  then  an  upper  capacity  is  n'^ld\. 

Anecdotal  evidence  in  implementations  indicates  that  the  above  esti¬ 
mate  of  upper  capacity  actually  holds  as  an  estimate  of  capacity  as  was 
the  case  for  d  =  1.  There  is  some  theoretical  support  for  this  though  no 
complete  proof.  The  main  difficulty  is  that  we  cannot  directly  apply  Prop¬ 
osition  3.2  to  the  matrix  Cl  as  the  distribution  induced  on  vertices  of  as 
we  build  up  generations  according  to  Eqs.  (7)  and  (8)  is  not  uniform — 
indeed,  we  can  only  access  2"  out  of  the  total  of  2^'  vertices.  Note, 
however,  that  any  two  distinct  vectors,  b  and  v,  in  B"  when  expanded  to 
vectors  fi  and  v  in  B'^''  according  to  Eq.  (8)  become  more  and  more  nearly 
orthogonal  as  the  number  of  generations  increase.  In  fact,  let  D  be  the 
Hamming  distance  between  u  and  v.  Then  it  is  easily  verified  that  the 
Hamming  distance,  D,  between  a  and  i  is  given  by’ 


(If  d  is  even  replace  the  first  sum  by  a  sum  over  even  indices,  j  =  0,  2, 
.  .  .  ,  d.)  As  d  increases  the  vectors  B  and  V  approach  orthogonality,  and 
in  fact,  any  pair  of  vectors  u  and  v  in  B"  result  in  orthogonal  vectors  u  and 
v  in  B’""'  when  all  odd  (or  even)  generations  are  included — i.e.,  when  d  is 
equal  to  n  or  n  -  1.  To  verify  this  note,  for  instance,  that  for  any  Ham¬ 
ming  distance  0  <  D  <  n  between  two  vectors  in  B"  the  corresponding 
Hamming  distance  D  between  the  corresponding  vectors  in  B’"  '  when  ail 
odd  generations  are  included  is 


"  2  i  f ) 

J<M  ioM  ' 

-i(?) 

*o<M  j<M  ' 


’  For  simplicity  we  use  the  convention  ({)  =  0ifo<horh<0. 


HIGHER-ORDER  NEURAL  NETWORK  CAPACITY 


329 


=  2"-^ 


Hence  <u,v)  =  0  for  any  two  vectors  u  ^  -v  in  B"  when  all  odd  (or  even) 
generations  are  included.  The  preceding  analysis  does  not  work  when 
D  =  n\  i.e.,  we  start  with  two  opposing  vertices  of  the  n-cube.  However, 
even  in  this  case  note  that  the  generated  vectors  u  and  v  become  orthogo¬ 
nal  if  we  include  all  even  andoAd  generations.  Thus,  though  the  statistical 
dependence  across  components  increases  with  the  number  of  generations 
included,  we  may  expect  a  concurrent  building  up  of  linear  independence 
as  the  randomly  chosen  memories,  u“,  result  in  more  and  more  nearly 
orthogonal  vectors  u“.  We  may,  hence,  expect  the  nonsingularity  proba¬ 
bility  estimate  of  Proposition  3.2  to  improve  for  the  generated  matrices 
U.'*  In  particular,  let  Nj  denote  the  length  of  the  extended  vectors  n  for 
any  choice  of  degree  d  (which  may  depend  on  n). 

Conjecture  3.4.  If  the  number  of  memories  satisfies  m^Nd  then  the 
Nd  'X  m  extended  matrix  of  memories,  U,  is  full  rank  with  probability 
approaching  one  as  n-* 

This,  in  turn,  would  yield  that  the  upper  capacity  estimate  of  Theorem 
3.3  would  actually  be  the  estimate  of  the  capacity  of  the  higher-order 
spectral  algorithm  of  degree  d. 


4.  Maximal  Capacity 

In  this  section  we  derive  the  maximal  storage  capacity  of  a  higher-order 
neural  network  of  degree  d.  The  results  are  independent  of  any  particular 
choice  of  algorithm,  and  depend  only  on  the  network  architecture — a 
higher-order  neural  network  of  degree  d.  The  maximal  capacity,  hence, 
delineates  the  upper  limit  on  storage  that  can  possibly  be  achieved  by  any 
particular  choice  of  storage  algorithm.  We  use  a  fundamental  result  due  to 
Schlafli  (1950)  enumerating  the  number  of  linearly  separable  dichotomies 
of  m  points  in  A^-space. 

Let  V  =  {v',  .  .  .  ,  v"}  C  R'''  be  an  m-set  of  points  in  A/-space. 


*  The  estimate  of  Proposition  3.2  may  itself  be  rather  weak.  As  conjectured  by  Komids, 
we  may  expect  the  majority  of  singular  ±  I  matrices  to  be  singular  for  the  trivial  reason  that 
two  rows  or  two  columns  coincide.  If  verified,  this  would,  of  course,  improve  the  estimate  of 
the  probability  of  nonsingularity  in  Proposition  3.2  to  I  -  0(n'2‘*). 
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Definition  4.1.  A  dichotomy  V  =  {V'^,  V'}  of  V  is  homogeneously 
linearly  separable  fhls)  if  there  is  a  vector  w  £  such  that  the  inner 
product 


<w,  v) 


>  0 
<  0 


if  V  E 
if  V  £  V-. 


(10) 


If  Eq.  (10)  holds  then  w  is  said  to  be  a  separating  vector  for  the  dichot¬ 
omy. 

The  following  version  of  Schlafli’s  counting  lemma  estimates  the  proba¬ 
bility  that  a  randomly  chosen  dichotomy  is  homogeneously  linearly  sepa¬ 
rable.  We  give  the  proof  for  completeness.  The  presentation  follows  that 
of  Wendel  (1962)  who  utilizes  the  result  in  this  form  in  a  problem  in 
geometric  probability.  (See  also  Cover  (1965)  for  a  slightly  different  ap¬ 
proach.) 

Lemma  4.2.  Let  V  be  an  arbitrary  m-set  of  points  in  R^,  and  let  V  be 
a  dichotomy  of\  chosen  independently  of\,  and  with  equal  probability, 
2~'",from  the  set  of  dichotomies  of\.  Then  the  probability.  P's,  thatV  is 
homogeneously  linearly  separable  is  bounded  by 


j.o  ^  J  ' 

Moreover,  a  sufficient  condition  enabling  us  to  replace  the  inequality 
above  by  equality  is  that  the  m~set  of  points  V  be  chosen  from  a  joint 
distribution  which  is  such  that  V  is  in  general  position — i.e.,  all  subsets  of 
size  N  are  linearly  independent — with  probability  one. 

Proof.  Let  DU  be  the  maximum  number  of  dichotomies  of  an  m-set  of 
points  in  R-''  that  are  his.  Then 

PH  s  2-'”DH. 

In  order  to  demonstrate  the  validity  of  Eq.  (1 1)  it  suffices,  hence,  to  show 
that 


0?;  =  2S('"/  O’  (‘2) 

j=0  '  J  ' 

Let  V  denote  an  m-set  of  points  for  which  DH  dichotomies  are  his. 
(Such  a  set  exists  as  DH  s  2"  is  finite.)  Let  V"  be  the  hyperplane  orthogo¬ 
nal  to  V".  Then  DH  is  the  number  of  path-components  in  R'^\Ur=i  V“  as 
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each  path-component  is  a  maximally  connected  set  of  vectors  all  homoge¬ 
neously  separating  the  same  dichotomy  of  V. 

Now  consider  the  effect  of  deleting  the  hyperplane  V".  The  remaining 
m  -  1  hyperplanes  determine  DT'  path-components.  These  are  of  two 
types;  (i)  those  path-components  (say  Qs  in  number)  which  have  a  nonnull 
intersection  with  the  hyperplane  V”,  and  (ii)  those  path-components  (say 
Qi  in  number)  which  do  not  intersect  V".  Clearly  then,  Ds~'  -  Q\  Qi- 
‘  With  V""  restored  it  cuts  each  path-component  of  type  (i)  in  two,  and 
leaves  path-components  of  type  (ii)  undisturbed.  Hence 

0%  =  2(2,  (22  =  DT'  +  (2.- 

Now  the  intersection  of  the  Q,  type  (i)  components  with  the  hyperplane 
V"  generates  Q,  path-components  in  V""VU”r,'(V""  n  V°).  As  the  sets  V” 
n  V®  are  just  the  hyperplanes  in  the  (N  -  i)-dimensional  space 
V”  orthogonal  to  the  projection  of  the  vectors  v®  into  V"",  it  follows  that 
Qi  =  Dn-\-  Hence 

DU  =  Dr'  +  Dn-\. 

This  recursion  with  the  obvious  boundary  conditions 

Djl  =  Df  =  2 

yields  the  solution  (12)  which  can  be  readily  verified  by  induction. 

To  complete  the  proof  we  need  to  show  that  we  can  replace  the  inequal¬ 
ity  in  Eq.  (1 1)  by  equality  if  the  m-set  of  points  V  is  in  general  position 
with  probability  one.  This  follows  immediately,  however,  from  the  simple 
observation  that  the  proof  above  continues  to  work  to  estimate  the  num¬ 
ber  of  his  dichotomies  of  any  m-set  of  points  which  has  an  attribute  which 
is  preserved  under  projections.  ■ 

We  require  the  following  technical  result  due  to  Chcmoff  (1952)  which 
gives  bounds  for  very  large  deviations  in  the  tails  of  the  binomial  distribu¬ 
tion. 

Lemma  4.3.  Fix  j  <  c  <  I  and  let  H  denote  the  entropy  function 

H(x)  =  -X  log:  jr  -  (I  -  Jc)log2(l  -  x)  (0  <  x  <  1). 

Let  p  denote  the  probability  that  in  M  trials  of  a  fair  coin  the  number  of 
successes  is  greater  than  or  equal  to  cM.  Then 

p  =  2-»  2  s  2-l'-"<^')*'. 

HcM\  ' 
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Theorem  4.4.  C,  =  2  ("rf')  is  a  maximal  upper  capacity  for  zero- 
diagonal  neural  networks  of  degree  d. 

Proof.  Let  U  =  {u', .  .  .  ,  u"}  be  a  randomly  specified  m-set  of  mem¬ 
ories  whose  components  are  generated  from  a  sequence  of  symmetric 
Bernoulli  trials.  If  each  of  the  memories  is  to  be  ®-stable  we  require  to 
find  real  coefficients,  w^^^,  I  such  that  for  each  /  £  [«],  and  a  6 

(w], 


w"  2 


(13) 


We  first  argue  that  without  loss  of  generality  we  can  restrict  attention  to 
a  margin  =  0,  In  fact,  if  there  exist  a  choice  of  coefficients,  W(,./),  such 
that 


u?  S  >  0.  '=1 . . 

then,  if  T  >  0  is  the  smallest  of  the  sums  above,  the  simple  expedient  of 
scaling  all  coefficients  by  a  positive  scalar  greater  than  35/ T  will 
result  in  Eq.  (13)  being  automatically  satisfied. 

Referring  to  the  evolution  rule  (3)  (with  margin  38  =  0)  we  see  that  each 
higher-order  neuron  in  a  zero-diagonal  network  of  degree  d  realizes  a 
separating  plane  in  ("j')-space.  For  the  memories  to  be  fixed  points  we 
hence  are  required  for  each  /  =  1,  .  .  .  ,  n  to  find  N  =  ("i')  real  coeffi¬ 
cients  I  E  i  ^  I  such  that 

uf  =  sgn  I  2  0=1 . "?•  (14) 

Now  fix  /  and  let  be  the  event  that  there  is  no  weight  vector  w,  = 
[*V(,./)]  in  A/-space  which  separates  the  dichotomy  of  the  extended  m-set  of 
memories,  [«"].  with  components  varying  over  the  set  of  indices  I  E  S'd  '- 
i  ^  /,  and  a  =  1, .  .  .  ,  m,  induced  by  Eq.  (14) — i.e.,  the  partition  of  the 
memories  according  to  whether  is  - 1  or  + 1 .  Note  that  the  term  u°  does 
not  appear  anywhere  in  the  sum  or  in  the  right-hand  side  of  Eq.  (14).  As 
the  components  m"  are  drawn  from  symmetric  Bernoulli  trials  it  follows 
that  the  dichotomy  indicated  in  Eq.  (14)  is  chosen  independently  of  the 
extended  m-set  of  memories.  By  Lemma  4.2  we  hence  have 


P{«i}  =  1  -  2:  I  - 


(15) 
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Let  S?  be  the  probability  that  there  exists  a  zero-diagonal  network  of 
degree  d  in  which  the  fundamental  memories  are  stable.  Then 


^  =  1  -  P 


1  -  Pi'gi}. 


(16) 


Set  Af  =  m  -  1  for  notational  convenience.  Using  Eq.  (15)  with  the 
upjjer  bound  for  3*  in  Eq.  (16)  we  have 


Fix  X  >  0  and  choose  M  =  f2A/(l  +  X)l.  Then  N  =  C]M  where  0  <  Ci  <  1. 
Using  Lemma  4.3  we  hence  have 


9  <  2-«  (Y)  ^  0.  (n  *). 

y-O  ' 

Hence  2N  +  1  is  a  maximal  upper  capacity,  and  by  Proposition  2.6  so  is 
2N  =  2  ("5').  ■ 

A  maximal  lower  capacity  of  N  is  readily  demonstrated  if  an  indepen¬ 
dence  conjecture  similar  to  the  one  earlier  holds.  Fix  any  index  /  in  [n], 
and  consider  an  extended  set  of  A/  memories  ^  ['”]>  where 

each  extended  memory  is  a  binary  (±1)  vector  of  length  N.  Denote  this 
set  of  (extended)  memories  by  0. 

Conjecture  4.5.  The  set  of  extended  memories  0  is  linearly  indepen¬ 
dent  with  probability  approaching  one  as  n—* 

For  a  choice  of  m  ^  ("d'),  Pjv  =  1  for  almost  all  choices  of  m  memories 
by  Lemma  4.2  if  the  above  holds.  This  will  yield  a  lower  maximal  capacity 
of  IV  =  Urf').  We  can,  however,  hope  for  more:  the  following  application 
of  a  result  of  Fiiredi  (1986)  provides  a  lower  bound  for  the  probability  that 
a  dichotomy  of  a  randomly  chosen  m-set  from  the  vertices  of  an  A^-cube  is 
his. 

Lemma  4.6.  Let  an  m-set  of  points  be  chosen  independently  from  the 
uniform  distribution  over  the  vertices  of  the  binary  N-cube,  Then,  if 
m  s  2N,  the  probability  that  an  arbitrary  dichotomy  of  the  m-set  of  points 
is  his  is  bounded  below  by 
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pm  =  J-t"-')  X  .•  ’)  -  0(b-^,  (N-  CC), 

;=o  '  J  ' 

where  b  >  I  is  a  fixed  constant. 

Remarks.  The  exponentially  small  order  term  quoted  above  is  a  re¬ 
finement  of  Furedi’s  original  estimate  of  0(N~*^)  using  Proposition  3.2. 
The  result  eschews  the  general  position  requirement  of  Lemma  4.2.  Spe¬ 
cifically,  the  upper  bound  for /’Jin  Eq.  (11)  is  sharp  if  m  s  2)Vand  the  m- 
set  of  points  is  chosen  independently  from  the  uniform  distribution  on 
vertices  of  the  N-cube. 

Furedi’s  result  makes  it  appear  likely  that,  in  fact,  2N  =  2  C^')  is  the 
maximal  capacity  of  a  zero-diagonal  higher-order  network  of  degree  d. 
We  again  have  a  situation  as  in  the  previous  section  where  we  would  like 
to  apply  the  result  not  to  the  uniform  distribution,  but  to  the  distribution 
corresponding  to  the  dth  generation  of  an  m-set  generated  randomly  from 
the  uniform  distribution  on  B".  If  the  above  lemma  continues  to  hold  for 
this  situation,  then  for  m  is  2N  v/t  can  replace  the  estimate  (15)  in  the 
proof  of  the  theorem  above  by 


p{1g'}  =  1  -  2-«  X  IT}  +  0{b-^, 

jmQ  '  y  ' 

where,  again,  we  set  M  =  m  -  1.  Using  the  union  bound  we  have  from 
Eq.  (16)  that 

I  -  nV{%'^}  <  9. 

Fix  0  <  \  <  j  and  choose  M  =  [2A^(I  -  A)j.  Under  the  above  assumption 
we  then  have  for  d  =  o(n)  that 


9^  a  1  -  n  [l  -  2-«  2  +  0(b 

=  1  -/i[2-«  1  + 

•-  J^c,M  '  J 
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where  i  <  cj  <  I.  As  A/  =  and  Af  ~  n‘'/dl  we  then  have  by 
Chemoff’s  large  deviation  bound  (Lemma  4.3)  that  for  a  choice  of  con¬ 
stant  C3  >  0 

S'  >  1  -  -  0(nb~^'’^  -*1.  (n  -+  «). 

So  2N  +  1  is  a  lower  sequence  of  maximal  capacities,  and  hence,  so  is  2N 
by  Proposition  2.6  if  Furedi's  result  holds  in  this  case. 

For  the  case  </  =  1  it  is  clear  that  Furedi's  lemma  holds  in  toto  so  that 
the  above  analysis  works  with  N  =  n  -  1 .  For  the  case  of  linear  interac¬ 
tions,  hence,  we  have  shown  the  following 

Theorem  4.7.  The  sequence  2n  is  the  maximal  capacity  for  zero- 
diagonal  neural  networks  with  linear  interactions,  d  =  1. 

Remark.  It  is  known  that  2n  is  the  capacity  of  a  single  linear  threshold 
element  [cf.,  for  instance.  Cover,  1%S;  Venkatesh  and  Psaltis,  1991]. 
The  above  result  asserts  that  there  is  no  decrease  in  capacity  for  the  zero- 
diagonal  network  of  n  neurons  even  though  we  now  have  a  situation 
where  n  neurons  operate  on  the  same  set  of  memories. 


5.  Concluding  Observations 

1.  For  the  case  d  =  I  Abu-Mostafa  and  St.  Jacques  (1985)  demon¬ 
strate  that  with  the  requirement  that  all  choices  of  m  vectors  be  stored  as 
fixed  points  for  some  choice  of  zero-diagonal  network,  m  can  be  no  larger 
than  n.  However,  small  pathological  sets  of  vectors  which  cannot  be 
stored  can  be  found  (Montgomery  and  Vijayakumar,  1986),  and  such 
pathologies  make  it  difficult  to  achieve  nontrivial  deterministic  capacities. 
The  probabilistic  setup  adopted  here  essentially  relaxes  the  requirement 
that  all  choices  of  m  vectors  be  storable  to  the  requirement  that  almost  all 
choices  of  m  memories  be  storable;  pathological  scenarios  that  cannot  be 
stored  form  a  set  whose  size  is  small  compared  to  (^),  and  are  effectively 
ignored  in  this  definition. 

2.  The  maximal  capacities  for  nonzero  diagonal  networks  are  of  the 
same  order  as  those  for  the  zero-diagonal  networks.  Note,  however,  that 
we  are  required  to  put  restrictions  on  the  allowable  choices  of  interac¬ 
tions.  Specifically,  consider  the  case  d  =  ].  With  a  choice  of  identity 
matrix  of  interactions,  w^  =  6^,  it  is  clear  that  all  states  in  B"  are  stable 
with  the  same  margin  of  stability.  There  is  clearly  no  associative  storage 
possible  in  this  situation.  To  avoid  situations  of  this  type  we  have  to  put 
constraints  on  the  allowable  interactions  so  that  the  number  of  extraneous 
stable  states  do  not  become  too  large:  specifically,  the  diagonal  terms 
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should  not  dominate  the  nondiagonal  terms.  Similar  examples  hold  for  the 
higher-order  cases. 

3.  The  capacity  estimates  continue  to  hold  if  we  are  required  to 
store  random  associations  of  the  form  u"  h-»  v*.  We  then  call  the  vectors  v" 
the  associated  memories.  The  spectral  algorithm  generalizes  in  a  straight¬ 
forward  manner  with  the  interaction  matrix  of  coefficients  of  Eq.  (9) 
modified  to 

W  =  VA(U^U)-'0^ 

with  V  being  the  m  x  m  matrix  of  associated  memories. 

4.  The  main  unresolved  issue  in  this  work  is  the  conjecture  intro¬ 
duced  in  this  paper  that  the  linear  independence  property  is  preserved 
(strengthened!)  when  we  consider  higher  generations  of  vectors  chosen 
uniformly  from  B".  TTiis  is  independent  of  the  Komlds  conjecture. 
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Recent  results  on  the  memory  storage  capacity  of  the  outer-product  algorithm 
indicate  that  the  algorithm  stores  of  the  order  of  n/log  n  memories  in  a  network  of 
n  fully  interconnected  linear  threshold  elements  when  it  is  required  that  each 
memory  be  exactly  recovered  from  a  probe  which  is  close  enough  to  it.  In  this 
paper  a  rigourous  analysis  is  presented  of  generalizations  of  the  outer-product 
algorithm  to  higher-order  networks  of  densely  interconnected  polynomial  thresh¬ 
old  units  of  degree  d.  Precise  notions  of  memory  storage  capacity  are  formulated, 
and  it  is  demonstrated  that  both  static  and  dynamic  storage  capacities  of  all 
variants  of  the  outer-product  algorithm  of  degree  d  are  of  the  order  of  n*/log  n, 
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1.  Introduction 


1 . 1 .  Overview 

Formal  neural  network  models  of  densely  interconnected  linear  thresh¬ 
old  gates  have  found  considerable  recent  application  in  a  variety  of  prob¬ 
lems  such  as  associative  memory,  error  correction,  and  optimization.  In 

•  Presented  in  part  at  the  IEEE  Conference  on  Neural  Information  Processing  Systems, 
Denver,  Colorado,  November.  1987,  and  at  the  IEEE  International  Symposium  on  Informa¬ 
tion  Theory,  Kobe,  Japan,  June,  1988. 

t  Corresponding  author. 

t  Also  the  Division  of  Biology.  California  Institute  of  Technology,  Pasadena.  CA  91125. 
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these  networks  the  model  neurons  are  linear  threshold  elements  with  n 
real  inputs  and  a  single  binary  output.  Each  neuron  is  characterized  by  n 
real  weights,  say  w,) , .  .  .  ,  Win,  and  a  real  threshold  (which  we  assume 

to  be  zero  for  simplicity).  Given  inputs  U| . the  /th  neuron 

produces  an  output  Uj  €  {-1,  1}  which  is  simply  the  sign  of  the  weighted 
sum  of  inputs: 


A  fully  interconnected  network  of  n  formal  neurons  is  then  completely 
characterized  by  an  «  x  «  matrix  of  real  weights. 

A  number  of  authors  have  recently  begun  to  investigate  more  general 
networks  obtained  by  incorporating  polynomial  instead  of  linear  interac¬ 
tions  between  the  threshold  processing  elements.  Specifically,  the  linear 
threshold  elements  of  Eq.  (1)  are  replaced  by  polynomial  threshold  ele¬ 
ments  of  given  degree  the  output,  u,-, ,  of  the  I'lth  higher-order  neuron  in 
response  to  inputs  ui, .  .  .  ,  is  given  by  the  sign  of  an  algebraic  form 

Vi,  =  sgn(  2  (2) 

The  number  of  interaction  coefficients  is  increased  to  n'^*'  from  the  n^ 
weights  for  the  case  of  linear  interactions.  The  added  degrees  of  freedom 
in  the  interaction  coefficients  can  potentially  result  in  enhanced  flexibility 
and  programming  capability  over  the  linear  case;  in  general,  the  computa¬ 
tional  gains  match  the  added  degrees  of  freedom  (Venkatesh  and  Baldi, 

1991).' 

In  this  paper  we  estimate  the  maximum  number  of  arbitrarily  specified 
vectors  (memories)  that  can  be  reliably  stored  by  the  outer-product  algo¬ 
rithm  in  a  higher-order  network  of  degree  d.  We  estimate  both  static 
capacities — where  we  require  the  memories  to  be  stored  as  fixed  points  of 
the  network — and  dynamic  capacities — where  the  specified  memories  < 

are  required  to  be  attractors  as  well.  Our  principal  results  are  as  follows: 

The  static  and  dynamic  storage  capacities  of  all  variants  of  the  outer- 
product  algorithm  generalized  to  degree  d  are  of  the  order  of  n'^llog  n 
memories. 

The  maximal  storage  capacities  that  can  be  realized  in  a  higher-order 
network  of  degree  are  of  the  order  of  n'' (Venkatesh  and  Baldi,  1991),  so 


'  Higher-order  neural  with  random  inieraciions  lead  to  rather  difTerent  computational 
issues.  We  deal  with  these  in  a  concurrent  paper  (Venkatesh  and  Baldi,  1989a). 
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that  the  outer-product  prescription  for  storing  memories  loses  a  logarith¬ 
mic  factor  in  capacity.  This,  however,  is  somewhat  offset  by  the  ease  of 
programmability  and  the  simplicity  of  the  algorithm. 

Notation.  We  utilize  standard  asymptotic  notation  and  introduce  two 
(nonstandard)  notations.  Let  {jr„}  and  {y„}  be  positive  sequences.  We 
denote: 

1 .  jc„  =  n(  >',)  if  there  is  a  positive  constant  K  such  that  x,/y„  s  K  for 
all  n; 

2.  x„  =  0(y„)  if  there  exists  a  positive  constant  L  such  that  xjy„  ^ 
L  for  all  n\ 

3.  x„  =  0()’„)  if  Xn  =  0(y„)  and  x„  =  fi(yj; 

4.  Xn  ~  y„  if  xjyn  -►  1  as  n  ^  ac;  we  also  use  x,  s  y„  if  x„/y,  <  I  for  n 
large  enough,  and  x,  s  y„  if  x„/y„  ^  1  for  «  large  enough; 

5.  X,  =  o(y,)  if  x,/y, -►  0  as  n -♦ 

We  also  say  that  a  positive  sequence,  is  polynomially  increasing  if 
log  Mn  =  6(log  n)  for  any  fixed  base  of  logarithm.  (All  logarithms  in  the 
exposition  are  to  the  base  e.)  We  denote  by  B  the  set  {-1,  1},  and  by  [n] 
the  set  {I,  2,  .  .  .  ,  «}.  Finally,  by  an  ordered  multiset  we  mean  an 
ordered  collection  of  elements  where  repetition  is  allowed. 

Organization.  The  basic  definitions  were  set  up  in  a  preceding  paper 
(Venkatesh  and  Baldi,  1991),  and  we  briefly  summarize  them  in  the  rest  of 
this  section.  In  Section  2  we  describe  the  generalization  of  the  outer- 
product  algorithm  to  higher-order  networks.  In  Section  3  we  present  the 
main  theorem  on  the  static  storage  capacity  of  the  outer-product  algo¬ 
rithm.  In  Section  4  we  prove  the  theorem  for  the  simplest  case  of  first- 
order  interactions  where  the  neurons  are  linear  threshold  elements;  the 
proof  techniques  used  here  are  somewhat  simpler  than  those  for  the  gen¬ 
eral  case.  In  Section  5  we  prove  the  main  theorem  on  the  static  capacity  of 
the  higher-order  outer-product  algorithm.  Following  the  proof  of  the  main 
theorem,  in  Section  6  we  then  infer  similar  static  capacity  results  for  the 
outer-product  algorithm  when  self-interconnections  are  proscribed — the 
zero-diagonal  case.  In  Section  7  we  consider  the  dynamic  case.  Theorems 
are  proved  in  the  body  of  the  paper,  while  technical  results  needed  in  the 
proofs  are  confined  to  the  Appendix. 


1.2.  Higher-Order  Neural  Networks 

We  consider  recurrent  networks  of  polynomial  threshold  units  each  of 
which  yields  an  instantaneous  state  of  -I  or  -t- 1 .  More  formally,  for 
positive  integers  n  and  d,  let  Sj  be  the  set  of  ordered  multisets  of  cardinal- 
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ity  d  of  the  set  [n].  Clearly  =  n*'.  For  any  subset  I  =  (i\,  12 . id) 

E  Sj,  and  for  every  u  =  (wi,  wj,  .  .  .  ,  uj  E  B",  set  uj  =  n/»i 

Definition  1.1.  A  higher-order  neural  network  of  degree  d  is  charac¬ 
terized  by  a  set  of  n‘‘*'  real  weights  Waj)  indexed  by  the  ordered  pair  (/,  /) 
with  i  E  [rt]  and  I  E  and  a  real  margin  of  operation  35  s  0.  The 
network  dynamics  are  described  by  trajectories  in  a  state  space  of  binary 
n-tuples,  O'*:  for  any  state  n  €  B"  on  a  trajectory,  a  component  update  u, 
u'i  is  permissible  iff 


u'i 


—  1  if  Sfesj  Wf,j,U/ < 

-u,  if  -a  <  2/ej,  w,,jtu,  <  a 

J  if  Wuj)Ut  >  a. 


(3) 


The  evolution  may  be  synchronous  with  all  components  of  u  being  up¬ 
dated  according  to  the  rule  (3)  at  each  epoch,  or  asynchronous  with  at 
most  one  component  being  updated  per  epoch  according  to  Eq.  (3). 

The  network  is  said  to  be  symmetric  if  Wuj)  =  W(jj)  whenever  the  {d  + 
l)-tuples  of  indices  (1,  /)  and  O’,  J)  are  permutations  of  each  other.  The 
network  is  said  to  be  zero-diagonal  if  wuj)  =  0  whenever  any  index 
repeats  in  (/,  /)■ 

Let  irf  denote  the  set  of  subsets  of  d  elements  from  (n);  |  =  (J). 

Combining  all  redundant  terms  in  Eq.  (3),  for  symmetric,  zero-diagonal 
networks  a  component  update  «,  *-*  is  permissible  iff 


Ui 


-1 

-Ui 

1 


if  -a  s  ^uj)Ui  =£  a 


(4) 


As  in  the  case  of  recurrent  networks  of  linear  threshold  units,  the 
dynamics  of  recurrent  higher-order  networks  can  be  described  by 
Lyapunov  functions  (Hopfield,  1982;  Coles  and  Vichniac,  1986;  Maxwell 
et  al.,  1986;  Psaltis  and  Park,  1988;  Venkatesh  and  Baldi,  1989a)  under 
suitable  conditions  on  the  interaction  weights.  Consider,  in  particular,  a 
symmetric,  zero-diagonal  network  of  degree  d.  For  n  G  B"  define  the 
algebraic  Hamiltonian  of  degree  d  by 

/i'rffu)  =  -  y  w,u,. 

lift., 

Wc  then  have  the  following  assertion  which  we  give  without  proof. 
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Proposition  1 .2.  The  function  Ha  is  nonincreasing  under  the  evolu¬ 
tion  rule  (4)  in  asynchronous  operation. 

In  light  of  such  results  we  are  interested  in  the  number  of  fixed  points  of 
the  network  and,  in  the  associative  memory  application,  in  the  trajecto¬ 
ries  leading  into  the  fixed  points. 

Definition  1 .3.  Let  s  0  be  fixed.  A  state  u  6  B"  of  a  higher-order 
neural  network  of  degree  d  is  said  to  be  ^-stable  iff 

Ui  2  ^uj)Ui  >  i  =  I,  .  .  .  ,  n. 

Likewise,  a  state  u  6  B"  of  a  zero-diagonal  network  is  said  to  be  9S-stabIe 
iff 

'  =  1 . 

It  is  easy  to  see  that  di-stable  states  are  fixed  points  of  the  higher-order 
network  with  evolution  under  a  margin  3&.  The  notion  of  38- stable  states  is 
explored  further  in  Komlds  and  Paturi  (1988)  and  Venkatesh  and  Baldi 
(1989a). 

We  refer  to  the  data  to  be  stored  as  memories.  By  an  algorithm  for 
storing  memories  we  mean  a  prescription  for  generating  the  interaction 
weights  of  a  higher-order  network  of  degree  d  as  a  function  of  any  given 
set  of  memories.  We  will  investigate  the  maximum  number  of  arbitrarily 
specified  memories  that  can  be  made  fixed  in  the  network  by  an  algorithm; 
this  is  a  measure  of  the  capacity  of  the  algorithm  to  store  data. 

1.3.  Memory  Storage  Capacity 

Let  u' . u'"  G  B"  be  an  m-set  of  memories  to  be  stored  in  a  higher- 

order  network  of  degree  d.  We  assume  that  the  memories  are  chosen 
randomly  from  the  probability  space  of  an  unending  series  of  symme¬ 
tric  Bernoulli  trials:  specifically,  the  memory  components,  /  G  [«], 
a  G  [m],  are  i.i.d.  random  variables  with 

P{i/“  =  -1}  =  PK  =  +1}  =  i- 

In  the  following  we  assume  that  the  network  architecture  is  specified  to  be 
a  higher-order  network  of  degree  d. 

Definition  1.4.  We  say  that  C«  is  a  capacity  function  (or  simply, 
capacity)  for  an  algorithm  iff,  for  every  choice  of  8  >  0,  the  following  two 
conditions  hold  as  «  -*  »: 
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(a)  The  probability  that  all  the  memories  are  fixed  points  of  the 
network  generated  by  the  algorithm  tends  to  one  whenever  m  ^  (1  - 
8)C,; 

(b)  The  probability  that  at  least  one  of  the  memories  is  not  a  fixed 
point  of  the  network  generated  by  the  algorithm  tends  to  one  whenever  m 
^  (1  +  8)C,. 

If  a  sequence  satisfies  condition  (a)  we  call  it  a  lower  capacity  function 
and  denote  it  by  C„ .  Likewise,  if  a  sequence  satisfies  condition  (b)  we  call 
it  an  upper  capacity  function  and  denote  it  by  C„ . 

Thus,  if  a  capacity  function  exists  for  an  algorithm,  then  it  is  both  a 
lower  and  an  upper  capacity  function  for  the  algorithm.  Define  an  equiva¬ 
lence  class  %  of  (lower/upper)  capacity  functions  by  C,,  Ci  e  C„  ~ 
C'„ .  We  call  any  member  of  %  the  (lower/upper)  capacity  (if  '€  is  non¬ 
empty).  Note  that  the  definitions  ensure  that  if  any  capacity  function 
exists  then  the  equivalence  class  of  capacity  functions  is  uniquely  defined 
(Venkatesh  and  Baldi,  1991).  (This  is  not  true,  however,  for  lower  and 
upper  capacities  which  are  always  guaranteed  to  exist.) 

The  above  definitions  of  capacity  require  that  all  the  memories  are  fixed 
points  with  probability  approaching  one.  We  obtain  weaker  definitions  of 
capacity  if  we  require  just  that  most  of  the  memories  be  fixed  points. 

Definition  1.5.  We  say  that  CJT  is  a  weak  capacity  function  (or  sim¬ 
ply,  weak  capacity)  for  an  algorithm  iff,  for  every  choice  of  8  >  0,  the 
following  two  conditions  hold  as  n-* 

(a)  The  expected  number  of  memories  that  are  fixed  points  is 
m{\  -  o(l))  whenever  m  £  (1  -  SICJT; 

(b)  The  expected  number  of  memories  that  are  fixed  points  is  o(m) 
whenever  m  s  (1  -i-  SIC*. 

If  a  sequence  satisfies  condition  (a)  we  call  it  a  weak  lower  capacity 
function  and  denote  it  by  C*-  Likewise,  if  a  sequence  satisfies  condition 
(b)  we  call  it  a  weak  uppeTcapacity  function  and  denote  it  by  C^. 

We  again  define  an  equivalence  class  of  Oower/upper)  capacity 
functions  by  C*,  C*  G  O  C*  —  Cn  .  We  call  any  member  of '€'*  the 
weak  (lower/upper)  capacity  (if is  nonempty). 

For  the  network  to  function  as  an  associative  memory  we  require  that  it 
corrects  for  errors  in  inputs  sufficiently  close  to  the  stored  memories. 

Definition  1.6.  For  a  given  mode  of  operation  (synchronous  or 
asynchronous)  and  a  chosen  time  scale  of  operation  (synchronous  one- 
step,  synchronous  multiple-step,  or  asynchronous  multiple-step)  we  say 
that  a  memory  is  a  p-attractor  for  a  choice  of  parameter  0  s  p  <  i  iff  a 
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randomly  chosen  state  in  the  Hamming  ball  of  radius  pn  at  the  memory  is 
mapped  into  the  memory,  within  the  given  time  scale,  and  for  the  given 
mode  of  operation,  with  probability  approaching  one  as  n  -♦ 

In  a  manner  cor.  .pletely  analogous  to  the  definitions  of  capacity  above, 
we  can  now  define  p-attractor  capacities  and  weak  p-attractor  capacities 
for  the  given  mode  of  operation  and  the  given  time  scale  of  operation  by 
replacing  the  requirement  of  stable  memories  by  the  requirement  that  the 
memories  be  p-attractors. 


2.  The  Outer-Product  Algorithm 
2.1.  The  Classical  Hebb  Rule 

The  outer-product  algorithm  (a  special  case  of  what  is  known  as  the 
Hebb  rule)  has  been  proposed  by  several  authors  as  appropriate  in  a 
model  of  physical  associative  memory.  While  the  algorithm  is  of  some 
antiquity,  formal  analyses  of  the  performance  of  the  algorithm  have,  how¬ 
ever,  become  available  only  recently  (cf.  McEliece  et  at.,  1987;  Newman, 
1988;  and  Kon  16s  and  Paturi,  1988).  (For  related  nonrigourous  results 
based  upon  replica  calculations  and  statistical  physics  see,  for  instance, 
Amit  et  ai,  1985,  and  Peretto  and  Niez,  1986.] 

Let  u',  .  .  .  ,  u"  €  B"  be  an  m-set  of  memories.  We  will  assume  that 
the  components,  wf ,  i  =  1,  .  .  .  ,  n,  a  =  1,  .  .  .  ,  m,  are  drawn  from  a 
sequence  of  symmetric  Bernoulli  trials.  For  the  linear  case  </  =  I  the 
outer-product  algorithm  prescribes  the  interaction  weights,  w^,  according 
to  the  rule 


*^(/  =  2  ij=  1 . n, 

where  g  is  a  parameter  with  0  <  g  s  I .  and  8j,  is  the  Kronecker  delta. 

It  can  be  easily  seen  that  in  this  algorithm  the  memories  are  stable  with 
high  probability  provided  m  is  small  compared  to  n\  further,  the  construc¬ 
tion  utilizing  outer-products  of  the  memories  results  in  a  symmetric  inter¬ 
action  matrix  which  in  turn  ensures  that  stable  memories  are  attractors. 
The  algorithm  hence  functions  as  a  viable  associative  memory.  McEliece 
et  al.  (1987)  (cf.  also  Koml6s  and  Paturi,  1988)  carried  out  precise  analyti¬ 
cal  calculations  of  the  storage  capacity  of  the  outer-product  algorithm 

^  For  linear  interactions,  rf  =  1,  Komids  and  Paturi  (1988)  have  investigated  the  more 
stringent  case  where  they  require  the  entire  Hamming  ball  of  radius  pn  around  a  memory  to 
be  attracted  to  the  memory. 
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under  a  variety  of  circumstances  and  showed  that  the  capacity  of  the 
outer-product  algorithm  is  of  the  order  of  n/log  n.^ 

The  attractiveness  of  the  outer-product  algorithm  for  associative  mem¬ 
ory  has  led  several  investigators  including  Lee  et  al.  (1986),  Maxwell  et 
al.  (1986),  Psaltis  and  Park  (1986),  and  Baldi  and  Venkatesh  (1987,  1988) 
to  independently  propose  higher-order  extensions  of  the  algorithm. 

2.2.  Outer-Products  of  Higher  Degree 

While  the  results  of  McEliece  et  al.  (1987)  indicate  that  for  the  linear 
case  d  =  1,  the  capacity  of  the  outer-product  algorithm  does  not  depend 
on  whether  self-connections  are  present  or  absent,  the  same  does  not 
continue  to  hold  true  for  higher-order  generalizations  of  the  algorithm. 

As  before,  we  consider  an  m-set  of  memories,  u' . u™  £  B", 

whose  components  are  chosen  from  a  sequence  of  symmetric  Bernoulli 
trials.  Consider  first  a  network  of  n  higher-order  neurons  with  dynamics 
specified  by  Eq.  (3).  For  every  i  in  (n)  and  ordered  multiset  I  G  the 
outer-product  algorithm  of  degree  d  specifies  the  interaction  coefficients, 
w^iJ) ,  as  a  sum  of  generalized  outer-products 


(5) 


For  the  zero-diagonal  case  we  use  the  same  prescription  to  specify  each 
w^ij)  with  I  e  [«]  and  /  £  and  dynamics  specified  by  Eq.  (4). 

While  heuristic  arguments  suggest  that  the  increase  in  the  available 
degrees  of  freedom  in  the  specification  of  the  interaction  coefficients 
would  result  in  a  commensurate  increase  in  the  fixed  point  storage  capac¬ 
ity  (Peretto  and  Niez,  1986;  Baldi  and  Venkatesh,  1987),  hitherto  no  rigor¬ 
ous  estimates  of  storage  capacity  have  been  demonstrated.^  We  provide  a 
formal  analysis  in  the  subsequent  sections. 

3.  Fixed  Points  and  Static  Capacity 
3.1.  The  Main  Result 

Consider  a  network  of  degree  d.  By  the  evolution  rule  (3),  if  the  ith 
component  of  the  alh  memory  is  to  be  stable,  we  require  that 

’  The  capacity  estimates  of  McEliece  et  al.  apply  to  the  case  where  the  memories  are 
required  to  be  stable — or,  more  generally,  where  they  are  required  to  be  attractors — which 
will  be  our  principal  consideration  in  this  paper.  A  somewhat  difTerent  computational  feature 
of  the  algorithm  has  been  investigated  by  Newman  (1988)  and  Komids  and  Paturi  (1988)  who 
demonstrated  that  if  errors  are  permitted  in  recall  of  the  memories  then  the  capacity  of  the 
ouler  product  algorithm  can,  in  fact,  increase  linearly  with  n  (cf.  also  the  epsilon  capacity 
results  of  Venkatesh  (1986)  and  Venkatesh  and  Psaltis  (1991)  in  this  regard). 

*  See  Newman  (1988),  however,  for  investigations  along  a  slightly  different  track. 
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If  each  of  the  memories  is  to  be  a  fixed  point  of  the  network  we  require  nm 
equations  of  the  above  form  to  be  simultaneously  satisfied,  one  per  mem¬ 
ory  component. 

Now  select  the  coefficients  h’„j)  according  to  prescription  (5)  for  the 
outer-product  algorithm  of  degree  d.  For  each  n  define  the  sequence  of 
doubly  indexed  random  variables  A"„‘*  with 


=  U?  2 

lesj 


S  S  Mrw/M?  = «“'  +  S  (“ru,-  S  «/«/)• 

»'-!  /eirf  v*a  ^  ' 


(6) 


Setting  for  V  a 


we  get 


=  n*'  +  2 


Y'^-r 


(7) 


(8) 


The  evolution  rule  (3)  will  fail  to  retrieve  the  /th  component  of  the  ath 
memory,  w“,  if  A'i”  ^  9B.  If  we  identify  the  term  n**  as  the  “signal”  term 
and  the  term  Ki®  "  as  the  “noise”  term,  a  memory  is  ®-stable  if  the 
signal  term  less  the  margin  exceeds  the  noise  term  for  each  component. 

Let  'jfl;®  denote  the  event  <  a},  and  let  %„  =  Ur=i  Ur=i  be  the 
event  that  one  or  more  memory  components  is  not  retrieved  (i.e.,  is  not 
^-stable).  We  are  interested  in  the  probability,  Pi'S?,},  of  the  event  'iS„:  we 
would  like  m  to  be  as  large  as  possible  while  keeping  the  probability  of  "ifn 
small,  i.e.,  m  as  large  as  possible  while  keeping  the  probability  of  exact 
retrieval  of  each  of  the  memories  high.  For  notational  simplicity  we 
henceforth  suppress  the  i,  a  dependence  of  the  random  variables  and 
except  where  there  is  possibility  of  confusion.  Denote 

fin  =  E{Y;}. 


and  for  each  d  let 


,  A 


(9) 
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The  following  theorem  is  the  main  result  of  this  section;  it  provides 
an  estimate  of  the  storage  capacity  of  the  outer-product  algorithm  of  de¬ 
gree  d. 

Theorem  3.1.  Consider  a  higher-order  neural  network  of  degree  d 
with  weights  chosen  according  to  the  outer-product  algorithm  of  Eq.  (5) 
and  with  a  choice  of  margin  98  =  mjj.„  in  the  evolution  rule  (3).  For  any 
fixed  £  >  0  and  to  >  0: 

1.  If  as  n  we  choose  m  such  that 

(I  -  w)n‘‘  r  2  log  log  n  +  2  log  2(d  +  OX^Ve 
“  2(2d  +  DXrflogrt  r  (2d+l)log« 

_  n  /log 
‘^V(logn)2/J’ 

then  the  probability  that  each  of  the  memories  is  mpn-stable  «  ^  I  -  e; 

2.  If,  as  n—*  <x,  we  choose  m  such  that 


(I  -  to)w^  r  log  log  n  +  log  2e(d  +  l)Xrf 
2id  +  l)X<<  log  n  L  log  n 


then  the  expected  number  of  memories  that  are  mti„-5table  is  2m(l  -  e). 

Remarks.  The  size  of  the  margin  of  operation  is  dictated  by  the  ex¬ 
pected  size  of  the  noise  term  for  a  typical  state  which  is  not  a  memory.  As 
we  will  see  in  the  subsequent  development,  the  expected  value  of  the 
noise  term  can  be  as  large  as  the  order  of  mn'‘^~'^.  If  this  is  not  compen¬ 
sated  for  in  the  margin  of  operation  a  large  number  of  extraneous  states 
(nonmemories)  will  also  become  fixed  points  of  the  system.  Note  also  that 
relaxing  the  requirement  that  all  the  memories  be  stable  to  just  requiring 
that  most  of  the  memories  be  stable  effects  (roughly)  a  twofold  increase  in 
the  number  of  memories  that  can  be  stored. 

Corollary  3.2.  For  a  given  degree  of  interaction  d  >  I  and  margin 
mp.„,  the  sequence 

^  (  {<1)120-'  s 

— "  \(2d  +  I)!/  log  n 

is  a  lower  capacity  for  the  outer-product  algorithm. 
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Corollary  3.3.  For  a  given  degree  of  interaction  d  >  1  and  margin 
mfi„ ,  the  sequence 


C 


i2d)]id  +  ly  log  n 


is  a  weak  lower  capacity  for  the  outer-product  algorithm. 

Remark.  Slightly  sharper  bounds  are  derived  in  Section  4  for  the  case 
d=\. 


3.2.  Outline  of  the  Proof 

The  main  step  involved  in  the  proof  of  the  theorem  is  the  estimation  of 
the  probability,  =  P{X„  ^  mp.„},  that  one  component  of  a  memory 

is  not  retrieved.  We  will  utilize  techniques  from  the  theory  of  large  devia¬ 
tions  of  a  sum  of  random  variables  from  its  mean  to  estimate  this  probabil¬ 
ity.  Over  the  next  two  sections  we  demonstrate  that  for  the  range  of  m  we 
consider,  the  following  estimate  holds;  for  any  tir  >  0 

s  m  exp  {-  (>2) 

The  probability  that  one  or  more  memory  components  are  not  retrieved  is 
less  than  nm  times  the  probability  that  one  memory  component  is  not 
retrieved;  likewise,  the  expected  fraction  of  memories  that  is  not  S5-stable 
is  just  the  probability  that  one  memory  is  not  dS-stable,  and  this  probabil¬ 
ity  is  bounded  by  n  times  the  probability  that  one  memory  component  is 
not  retrieved.  Using  the  estimate  of  Eq.  (12)  together  with  a  choice  of  m 
according  to  Eq.  (10)  and  (I  I),  respectively,  yields  an  upper  bound  of  e  for 
these  probabilities,  and  concludes  the  proof. 

The  two  corollaries  follow  as  a  consequence  of  uniformity:  the  proba¬ 
bility  that  all  the  memories  are  ^-stable  decreases  monotonically  as  the 
number  of  memories  increases.  If,  for  instance,  for  any  fixed  6  >  0  the 
number  of  memories  is  chosen  to  be  equal  to  (1  -  8)  times  the  capacity 
estimate  of  Corollary  3.2,  it  is  easy  to  see  that  for  large  n  the  number  of 
memories  will  be  less  than  that  specified  by  Eq.  (10).  The  resulting  proba¬ 
bility  that  ail  the  memories  are  ^-stable  will  hence  be  asymptotically 
better  than  1  -  e.  A  similar  line  of  reasoning  also  establishes  Corollary 

3.3. 

The  main  idea  in  establishing  Eq.  (12)  is  to  exploit  the  fact  that  the  r.v.'s 
Y''„,v  a  defined  in  (7)  are  i.i.d.  Referring  to  (8),  the  probability  that  a 
memory  component  is  not  retrieved  is  just  the  probability  that  the  sum, 
“  M/i).  of  (m  -  1)  zero-mean,  i.i.d.  r.v.’s  is  less  than  or  equal  to 
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-n^  +  pi, .  As  we  will  see  in  the  next  two  sections,  a  careful  estimation  of 
the  mean,  pt, ,  of  the  r.v.  ’s  will  yield  that  pt,  =  oin‘‘).  It  suffices,  hence, 
to  estimate  the  probability  that  ”  M<i)  ^  *•«•.  to  estimate  the 

probability  that  the  sum  of  r.v.’s  deviates  from  the  mean  by  the  large 
deviation  n‘‘. 

For  the  case  of  first-order  interactions,  d  =  1,  the  situation  simplifies 
somewhat.  For  this  case  the  r.v.’s  (FJ  -  pi,)  themselves  turn  out  to  be 
the  sum  of  (n  -  1)  i.i.d.,  symmetric  ±1  r.v.’s,  and  the  large  deviation 
estimate  for  the  probability  that  a  memory  component  is  not  retrieved  can 
be  obtained  by  an  application  of  the  generalized  Chebyshev  inequality. 
We  present  the  derivation  of  the  probability  estimate  for  this  case  in  Sec¬ 
tion  4. 

For  d  >  1  additional  problems  arise  as  the  r.v.  has  an  infinite  mo¬ 
ment  generating  function.  In  particular,  the  Chebyshev  estimates  of  Eqs. 
(33)  and  (34)  in  the  Appendix  work  only  trivially.  We  tackle  this  case  in 
Section  5.  The  results  needed  here  are  two  large  deviation  lemmas  (A.6 
and  A.7)  found  in  the  Appendix. 

4.  First-Order  Interactions 

We  begin  with  the  following  elementary  observation. 

Fact  4.1.  Let  b|,  .  .  .  ,  bshc  i.i.d.,  symmetric,  ±1  r.v.’s.  Let  ai, 

.  .  .  ,  be  any  set  of  ±1  r.v.’s  independent  of  the  r.v.’s  b*,  k  =  I, 

.  .  .  ,N.  Thenther.v.’sZ*  =  a*6*,k=  I . N  are  i.i.d,,  symmetric, 

±1  r.v.’s. 

Remark.  Note  that  the  r.v.’s  a*  need  not  be  symmetric  and  may  de¬ 
pend  on  each  other. 

Lemma  4.2  below  is  a  particular  application  of  Chebyshev’s  inequality. 
The  result  is  an  asymptotic  expression  for  the  probability  that 

a  particular  memory  component  is  not  retrieved.  The  result  agrees 
with  what  would  be  obtained  by  a  naive  application  of  the  Central  Limit 
Theorem. 

Lemma  4.2.  Let  the  order  of  interaction  be  d  =  \  and  let  di  =  m  be  the 
margin  of  operation.  If  the  number  of  memories,  m,  is  chosen  such  that 
m  =  o(n)  and  m/Vn  — »  *,  then 

s  exp  {- (^))  (n-x).  (13) 

Proof.  From  Eq.  (6)  we  can  write 

;ir,  =  n  -t-  m  -  1  +  2  S  2;. 

9^0  J*t 
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where,  for  fixed  i  and  a,  we  define  the  random  variables  Z)  =  uju'lufu). 
Note  that  by  Fact  4.1  the  r.v.’s  Zj,j  *  i,v  4^  a  are  i.i.d.,  symmetric,  ±1 
r.v.’s.*  By  Corollary  A. 2  we  have  for  a  choice  of  margin  ®  m  that 

=  nxn  ^  m}  =  p  js  S  ^  + 1} 

j*t  * 

^  inf 

raO 

=  infe-'<"  '>E  {n  n  e-<}. 

raO  j*i  * 


The  terms  in  the  product,  v  +  aj  4=  /  are  independent  r.v.’s  as  the 
r.v.’s  Zj  are  independent.  The  expectation  of  the  product  of  r.v.’s  above 
can,  hence,  be  replaced  by  the  product  of  expectations.  Accordingly, 
denoting  by  Z  an  r.v.  which  takes  on  values  -1  and  1  only,  each  with 
probability  i,  we  have 

s  inf  =  inf  ^-'■‘''■'’(cosh  r)'"- 

rxO  '*0 

Now,  for  every  r  E  R  we  have  cosh  r  s  Hence 

_ _  .  ,  lr\m  -  1)(«  -  I)  ,  I  (n  -  1)  \ 

P{ign  s  mf  exp  ( - 2 - ~  'V  =  I"  2{^;rT))- 

Equation  (13)  can  now  be  readily  verified  recalling  the  condition 
mI'Vn  — »  <».  ■ 

We  are  now  equipped  to  complete  the  proof  of  the  theorem  for  the  case 
d  =  1.  We  will,  in  fact,  prove  a  slightly  stronger  version  of  the  theorem 
with  constants  for  the  lower  capacity  which  are  larger  than  those  given  in 
Corollaries  3.2  and  3.3. 

Proof  of  Theorem  3.1  (d  =  1).  From  Eq.  (7)  we  have  that 


r„  =  2  =  1  +  S 

;=l  J*' 

Hence  ^L„  =  ^{Yl)  =  1,  so  that  the  requisite  margin  of  operation  in  the 
theorem  is  a  =  =  m.  It  is  easy  to  verify  that  a  choice  of  m  as  in  Eq. 

(10)  with  d  =  I  satisfies  the  conditions  of  Lemma  4.2.  Hence 


’  The  crilical  fact  here  is  that  each  r.v.  Z'  has  a  distinct  multiplicative  term  Uj  which 
occurs  solely  in  the  expression  for  Z). 
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P{«,}  =  P  {  U  U  ^  t  1  P{«i.“}  nm  exp  {-  (^)}.  (14) 

For  a  choice  of 


n  r  log  log  n  +  log  4£  _  /log  log  n\1 
4  log  n  L  2  log  n  V  (log  n)^  /  J 


(15) 


in  Eq.  (14)  we  have  that  P{^„}  e  as  «  — » As  the  probability  that  each 
of  the  memories  is  m-stable  is  exactly  1  -  Pi'in},  this  establishes  the  first 
part  of  the  theorem  (with  a  slightly  better  constant  for  the  critical  number 
of  memories). 

The  second  part  also  follows  similarly  by  noting  that  the  probability  that 
a  particular  memory  is  not  stable  is  by  the  union  bound,  and  for  a 

choice  of  m  given  by 

=  2l^  ^ 

this  yield*  •'  upper  bound  of  e  for  the  probability.  The  result  follows  as 
the  exp  number  of  memories  that  are  not  stable  is  m  times  the 
proba''  iiat  one  memory  is  not  stable.  (Again,  the  estimate  for  m 

give'  ;  (16)  is  slightly  sharper  than  that  quoted  in  the  theorem.)  ■ 

l  !”.’  .  riformity  of  the  binomial  distribution  helps  us  to  establish  the 
lower  .pai’ity  of  the  algorithm. 

CoK  >1  i.\KV  4.3.  For  a  degree  of  interaction  d  =  1  and  a  margin  of 
operatH'fi  =  m,  the  sequence 

C 

—  4  log  n 

is  a  lower  caprrity  for  the  outer-product  algorithm. 

Corollary  4.4.  For  a  degree  of  interaction  d  =  I  and  a  margin  of 
operation  %  =  m,  the  sequence 


C"  =  — - — 

—  2  log  n 

is  a  weak  lower  capacity  for  the  outer-product  algorithm. 

Proof  of  Corollaries  4.3  and  4. A.  We  will  sketch  the  proof  of  Corollary 
3.2;  the  proof  of  Corollary  3.3  is  similar.  Let  t*,  explicitly  denote  the 
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probability  that  one  component  of  a  memory  is  not  stable  as  a 

function  of  the  number  of  memories,  M.  Fix  any  choice  of  8  >  0,  and 
consider  a  number  of  memories,  Af  =  (1  -  8)/i/4  log  n.  For  any  e  >  0 
chosen  arbitrarily  small  in  Theorem  3.1  we  can  choose  n  large  enough  so 
that  M  <  m  with  nt  chosen  as  in  Eq.  (15).  The  result  now  follows  from 
Lemma  4.2  since  the  probability  that  at  least  one  memory  component 
is  not  retrievable  is  bounded  from  above  by  nMru  ^  s 

nme~'^"'  s  e.  ■ 

Remarks.  Corollary  4.3  provides  an  improvement  of  a  factor  of  ?  over 
the  lower  capacity  claimed  in  Corollary  3.2,  while  Corollary  4.4  provides 
an  improvement  of  a  factor  of  2  over  the  corresponding  weak  lower  ca¬ 
pacity  claimed  in  Corollary  3.3.  McEliece  el  al.  (1987)  show  that  n/4  log  n 
is  also  an  upper  capacity  for  the  outer-product  algorithm  for  the  linear 
interaction  case  d  =  1 ,  so  that  n/4  log  n  is,  in  fact,  the  capacity  of  the 
algorithm.  (The  constants  obtained  there  for  the  o(l)  terms  in  Eq.  (10) 
with  d  =  I  are  slightly  sharper — a  coefficient  of  J  for  the  log  log  n/log  n 
term  instead  of  the  coefficient  i  that  we  obtain  in  Eq.  (15) — but  these  do 
not  affect  the  capacity  results.)  The  proof  of  the  main  theorem  in 
McEliece  ei  al.  (1987)  also  yields  the  estimate  nil  log  n  for  the  weak 
capacity. 


5.  Higher-Order  Interactions 

The  above  proof  of  the  theorem  for  d  =  1  fails,  however,  when  the 
interaction  order  d  is  larger  than  one:  specifically,  for  d  3  and  r  >  0,  the 
r.v.  KJ  has  an  infinite  moment  generating  function  so  that  be¬ 

comes  unbounded  and  the  generalized  Chebyshev's  inequality  of  Eq.  (34) 
is  too  weak.  (For  r  =  0  the  Chebyshev  bound  is  trivial.)  To  see  this, 
consider  d  =  3,  for  instance,  and  r  >  0;  from  Eq.  (7)  we  obtain 

n  =  {\ +  2 

Let  U  ~  Ar(0,  1)  be  a  standard  normal  r.v.  By  Fact  4.1  the  summands 
ufu^UjUj,]  =k  i  are  i.i.d.,  ±1,  symmetric  r.v.’s  so  that  by  the  Central 
Limit  Theorem  TJ  converges  in  distribution  to  (1  -i-  Vn  -  I  t/)’,  and  this 
has  an  infinite  moment  generating  function.  For  d  =  2,  Chebyshev’s  in¬ 
equality  is  workable,  but  the  bound  is  terribly  weak.  We  will  hence  need 
the  large  deviation  lemma  A.7  to  cater  to  the  higher-order  cases. 

Before  proving  the  theorem  for  general  interaction  orders,  we  first  es¬ 
tablish  some  further  properties  of  the  random  variables  TL 
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Definition  5.1.  Let  Kbe  a  discrete  r.v.  taking  values  in  {0,}*=-,.  We 
say  that; 

1 .  y  is  skew-symmetric  if  P{  K  =  O-j)  =  P{  K  =  dj}  forj  =1 . k. 

2.  Y  is  unimodal  if  P{K  =  d-j)  <  P{y  =  0-,.,}  and  P{r  =  <»,.,}  > 

P{Y=  dj}  forj  =2 . A. 

We  note  that,  in  fact,  the  r.v.'s  H  are  skew-symmetric  and  unimodal. 
Set  =  UjUj.  For  fixed  a  ^  v,  the  r.v.'s  .  .  .  ,  f « are  i.i.d.,  symmet¬ 
ric,  ±1  r.v.’s. 

For  d  even,  Yn  =  (I,"  +  The  r.v.  Y),  takes  values  in  the  set 

{  -n".  -in  -  2)" . (n-  2)“,  n‘>) 

and  for  it  =  0,  1 . L«/2J 


P{Y:  =  -in  -  2it)‘0  =  P{>':  =  (n  -  2^)<0  =  (”)  2-. 

Hence  the  r.v.’s  KJ  are  symmetric  (consequently,  also  skew-symmetric) 
and  unimodal. 

For  d  odd,  TJ  =  (1  +  The  r.v.  Y^„  takes  values  in  the  set 

{-(n  -  2)*'.  -(n  -  4)0,  ...  An  -  2)",  n^} 
and  for  it  =  0,  1,  .  .  .  ,  l(n  -  1)/2J 

P{Y:  =  -(n  -2k-  2)0}  =  P{Y:  =  (n  -  2k)0}  =  ("  ^  ')  2-'--". 

Hence  the  r.v.’s  TJ  are  skew-symmetric  and  unimodal. 

Lemma  5.2.  For  each  n  the  r.v.’s  YU  are  i.i.d.  and  as  n ^  satisfy 

=  0  if  d  is  even 

(17) 

~  dKid-mn'o  if  d  is  odd,  d  =  o(n); 

Var(T;)  -XjnO  if  d  =  o(n).  (18) 

Remarks.  We  actually  show  a  little  more  than  is  claimed  here.  In  Eq. 
(20)  we  show  an  exact  expression  for  .  TTiis  is  needed  to  set  the  margin 
of  operation  accurately. 

Proof.  Recall  that  we  had  defined  n„  =  E{  KJ},  and  =  (2/)  !/(r)  !2'  for 
every  nonnegative  integer  t.  As  before,  denoting  f *  =  u°ul  for  A  =  1 , 
.  .  .  ,  n,  and  v  ^  awe  can  write 
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>1 

The  r.v.’sff, /:  =  1. .  .  .  ,  n, /3  o  are  mutually  independent  by  Lemma 
4.1.  Furthermore,  each  r.v.  KJ  is  determined  by  the  distinct  set  of  r.v.’s 

. 4^;;  which  appear  in  no  other  TS.  ^  ^  v-  Consequently,  the  r.v.’s 

KJ,  (v  a),  are  i.i.d.  for  each  n. 

When  d  is  even,  following  Definition  5.1  we  have  established  that  Y),  is 
symmetric  so  that  E(  Yn)  =  0.  Let  us  now  consider  d  odd.  From  Eq.  (7) 
and  by  reason  of  the  independent  choices  of  the  memories  ii“  and  u"  we 
have 


e(l;:)  =  £  E(Mf<  • 

Jt . 


w")E(m,X  •  •  ■  <). 


(19) 


We  now  use  the  elementary  fact  that  if  x  e  B  then 

X  if  k  is  odd 

X*  = 

I  if  it  is  even, 

together  with  the  independence  of  the  components  uj.  Each  expectation 
in  the  sum  in  Eq.  (19)  is  over  a  product  of  an  even  number,  d  +  1,  of  i  1 
r.v.’s  corresponding  to  the  fixed  index  i  and  to  each  assignment  of  values 
t0  7i,  .  .  .  ,jd.  The  expectation  will  have  value  1  iff  an  odd  number  of 
indices  jk  take  the  value  i,  and  for  every  index  value  h  ^  ian  even  number 
(possibly  zero)  of  indices y*  take  the  value  h;  otherwise  the  expectation 
has  value  0. 

Let  N,  be  the  number  of  waysy’i .  •  •  •  Jd  can  be  chosen  from  [n]  such 
that  precisely  q  of  the  jk  are  equal  to  /  with  each  distinct  value  assigned  to 
the  remaining  d  -  q  indices  occurring  an  even  number  of  times.  We  hence 
have 


E(yj)=  £  N2r-^. 

odd  r=  I 

Now  2r  -  1  indices  from  . . .  can  be  chosen  equal  to  /  in  (^,1 ,) 

ways.  We  must  enumerate  the  number  of  ways,  N^-\,  that  values  J  #  i 
can  be  assigned  to  the  remaining  d  -  2r  +  I  indices y*  such  that  each  index 
occurs  an  even  number  of  times. 

For  k  =  \,  .  .  .  ,(d  -  2r  +  I)/2  let  s  =  (j| ,  .  .  .  ,  j*)  be  a  vector  such 
that  1  <  jj  <  jj_,  <  .  .  .  s  S|  <  (<f  -  2r  -t-  l)/2  and  2*=|S>  =  (d  -2r  +  l)/2. 
Let  Si,  .  .  .  ,  Sk  partition  {xi ,  .  .  .  ,  j*}  in  such  a  way  that  each  S;  is  a 
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maximal  collection  of  sj's  that  are  equal,  and  let  yi  =  |5/|.  Define  the 
redundancy  factor 

/(s)  =  n  yt'- 

/=! 

We  claim  that 


(<f-2r-U/2 
-  2 


In  fact,  the  inner  sum  over  s  enumerates  the  number  of  ways  k  distinct 
values  j  ^  /  can  be  assigned  to  the  n  -  2r  +  1  indices  y*  with  each  index 
occurring  an  even  number,  2st,  of  times.  The  redundancy  factor,  /(s),  is 
required  to  compensate  for  overcounting  when  some  of  the  si's  are  equal. 
(For  instance,  /(s)  =  i!  if  j)  =  •  •  •  =  St,  while /(s)  =  1  if  each  5/  is 
distinct.)  Thus  (with  the  convention  that  S*!-)  =  1  if  <  a),  we  have 


=  E(r,)=  2  ^2,-1 


W*l)/2  /  d  \ 

-  2 

W+l»/2  /  d  \  W-2^'HV2 

=  2  /  ,)  ^  2 

r-i  'zr  —  1/  *-i  • 


{d-2r+  I)! 
(2j,)!  .  .  .  (2j*)!/(s) 


(n  -  1)  •  •  ■  {n  -  k) 


{(d  -  l)/2]!2«'-'« 


„W-1V2  +  C>(„W-3W)^  jf  ^  6  (21) 


Now  (K);)^  =  (2;=  I A  similar  argument  to  that  above  gives 

E{(  n<‘  +  0{n<‘- ').  if  d  =  o(n).’  (22) 

Equations  (21)  and  (22)  together  complete  the  proof  of  the  lemma.  ■ 


‘  We  can  verify  this  by  a  standard  CLT  argument.  Let  U  ~  X(0, 1)  be  a  standard  Gaussian 
r.v.  For  d  odd,  as  we  saw  from  the  earlier  representation,  V,  converges  to  (I  +  Vn  -  11/)' 
in  distribution  by  the  CLT.  Using  El/*  =  Oif  Ic  is  odd  and  El/‘  =  lc!/(t/2)!2*'’  if  I:  is  even,  the 
leading  term  in  the  binomial  expansion  of  E(l  +  Vn  -  I U)^  yields  the  result. 

We  do  not  directly  use  this  argument,  however,  as  the  exact  representation  of  the  mean  is 
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Remarks.  The  previous  result  establishes  the  need  for  a  margin  of 
®  =  mn„  in  the  evolution  rule  (3).  For  d  even,  of  course,  the  margin  is 
precisely  zero  as  the  r.v.’s  are  symmetric  and  have  zero  mean.  For  d 
odd,  however,  the  mean  of  the  noise  term  in  Eq.  (6)  will  be  of  the  order  of 
If  »  » then  this  dominates  the  signal  term,  in  Eq.  (6). 

Hence,  almost  all  states  (not  just  the  memories)  are  fixed  points  undei  an 
evolution  rule  with  zero  margin.  Removing  the  bias  due  to  this  mean 
results  in  the  evolution  rule  of  Eq.  (3)  with  a  choice  of  margin  mfi„. 
Clearly,  we  can  expect  the  memories  to  be  /w/i, -stable  because  there  is 
still  a  strong  bias  of  the  order  of  due  to  the  signal  term;  most  randomly 
chosen  states,  however,  will  not  be  m^„-stable.  The  usage  of  a  suitable 
margin  hence  ensures  performance  as  a  viable  associative  memory. 

Note  that  for  d  =  1,  however,  we  can  dispense  with  the  margin  of  m  as 
for  m  =  o(n)  the  signal  term  n  dominates  the  mean  noise  term  m.  Hence, 
for  the  linear  case  we  could  adopt  any  choice  of  margin  0  ^  35  ^  m,  and 
obtain  adequate  performance  with  the  same  capacity  (McEliece  ei  al., 
1987). 

The  following  main  lemma  uses  the  large  deviation  result  of  Lemma  A.7 
to  estimate  the  probability  that  a  single  component  of  any  given  memory 
is  not  ^-stable. 

Lemma  5.3.  For  any  interaction  order  d  >  1,  margin  35  =  and 
any  choice  of  parameter  D>  d  if  we  choose  m  such  that  m/i'"*®'  —»  » 

and  m  -  O  (n‘'/log  n),  then  for  every  m  >  0 

s  m  exp  |-  )•  «  “*  *•  (23) 

Proof.  Lemma  4.2  gives  the  result  for  rf  =  1.  We  hence  consider  the 
case  d  >  1.  Define  the  normalized  sequence  of  r.v.’s  D;  by 

r„  =  -  M,).  (24) 

By  Lemma  5.2  E{V„)  -  0  and  Var  (1^)  -»  1  as  n  =0.  Set  A/  =  m  -  1  for 
notational  simplicity.  Clearly  M  and  M  =  o(n^).  Using  Lemma  5.2 
with  Eqs.  (8)  and  (24)  we  have 


=  P{Ar„  s  mp„} 

=  p{2r,^- 


important  in  determining  the  probability  that  a  row-sum  violation  occurs.  If  we  use  only  the 
highesi-order  term  for  the  mean,  the  succeeding  terms  that  were  ignored  will  dominate  the 
inequality  as 

’  Again,  ( y,y  converges  to  in  distribution,  and  E(\/nt/)“  =  (2d)ln‘'/(<f)!2^. 
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By  construction,  and  by  Lemma  5.2,  the  r.v.’s  Ti,  satisfy  conditions  1  and 
2  of  Lemma  A. 7.  Comparing  Eqs.  (25)  and  (38),  we  hence  must  show  that 
conditions  3  and  4  are  also  met  for  the  choice  of  parameter  D  >  d  a  2  in 
order  to  complete  the  proof.  We  show  the  result  when  the  interaction 
order  is  odd,  so  that  D  >  d  s  3.  The  proof  is  similar  when  d  is  even. 

With  a  notation  similar  to  that  earlier,  we  have 


|r,|  =  |(i  +  2  f.-e;)'  - 

By  Lemma  5.2  we  have  that  fi„  =  (?(/?**'“  "'^).  Further,  it  is  easy  to  see  that 
|1  +  (Jn-xY  ^  1  +  Using  the  simple  inequality  (A  +  BY  ^ 

2'(A'  +  B')  valid  for  positive  A,  B,  and  x,  it  hence  follows  from  Lemma 
A. 6  that 


lim  sup  E{exp(j:|7^P®)} 

ff-** 

s  lim  sup  Elexp{x22<‘'"'»'OXj'®|U„-,|2‘"'’n-‘''o  +  0(n-'  ^)}] 

If-** 

<  00 

whenever  we  choose  x  such  that 


•  Note  that  y,  =  otW®’*'®'").  Hence,  the  bound  Eq.  (38)  in  Lemma  A. 7  will  hold  trivially 
for  any  positive  choice  of  y  <  Jt  once  condition  3  is  established. 
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This  establishes  Eq.  (36).  Now,  noting  that  T„  is  a  discrete  r.v.  which 
takes  on  only  a  finite  set  of  real  values  with  nonzero  probability,  we  have 
for  any  choice  of  A’,  =  fi{(log  that 


2  nn  =  /} 

'fl>/4nog 


^  ^  t'  +  M.)""  -  1}. 

UI>i4(log  fl)^ 


In  the  above  /4  >  0  is  a  real  constant,  and  the  summation  is  over  the  finite 
set  of  real  values  that  Vn  can  assume  in  the  range  |t|  >  >4 (log  Now, 
from  Eq.  (26)  we  have 


17^1  s  +  0(/j-''2)  <  2x;''^/i‘" 

as  |1  -t-  s  n.  Further,  U„-i  is  a  symmetric  (binomially  distributed), 
unimodal  r.v.  Hence,  we  can  find  B  =  A''^  ±  0{«"''^(log  n)'®'^}  such  that 

ft  max  =  (tXi'^«‘«  +  Mn)''"  -  1}. 

l(|>.4{lot 


It  follows  that 


1 dF\{t)  s  4X;'n<'  =  fixr«''^(log 


Vn 

Vtt  Xrf 


exp 


B^Xi'‘'(log  nf"‘\ 


by  an  application  of  Corollary  A. 5.  But  by  the  choice  of  D  we  ha'-e  that 
Did  >  1,  so  that  the  right-hand  side  of  Eq.  (27)  is  c>(n  '’*),  and  this 
concludes  the  proof.  ■ 

Proof  of  Theorem  3.1  (d  >  1).  An  application  of  Lemma  5.3  together 
with  the  union  bound  finishes  the  proof  of  the  theorem.  For  any  fixed 

tJ7  >  0 

p{««}  =  p  { U  U 

i"!  0=1 

,  /  (1  -  m)n'‘\ 
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It  can  now  be  readily  verified  by  substitution  of  Eq.  (10)  that  PfiSn}  s  e. 
Part  2  can  be  verified  similarly.  ■ 

A  uniformity  argument  similar  to  the  one  used  for  Corollary  4.3  com¬ 
pletes  the  proof  of  Corollaries  3.2  and  3.3  when  d  >  1.  It  appears  plausible 
that,  just  as  in  the  linear  case  d  =  1,  the  rates  of  growth  in  Corollaries  3.2 
and  3.3  also  apply  to  upper  capacities  for  higher  interaction  orders  d  >  1 . 
The  dependencies  in  the  random  variables,  however,  become  rather  more 
severe  when  d  >  1 ,  and,  as  yet,  there  are  no  rigorous  proofs  in  this  regard. 
In  particular,  the  proof  techniques  used  by  McEliece  et  al.  (1987)  in 
establishing  capacities  for  d  =  1  cannot  be  used  in  toto  for  the  higher- 
order  case. 


6.  Zero-Diagonal  Networks 

As  before,  let  u',  .  .  .  ,  o'"  G  B"  be  an  m-set  of  memories,  whose 
components  are  chosen  from  a  sequence  of  symmetric  Bernoulli  trials. 
We  now  consider  zero-diagonal  networks  with  interconnection  weights 
chosen  according  to  prescription  (5)  for  the  zero-diagonal  outer-product 
algorithm  of  degree  d. 

Analogously  with  the  notation  of  the  previous  section,  for  each  n  define 
the  sequence  of  doubly  indexed  random  variables  A'i®  with 


ATi.®  =  uf  'Z  =  ufZ  Z  «.■ 


^U)u1 


(28) 


Again  suppressing  the  i,  a  dependence  and  setting 

n  =  Z 

!«/ 


(29) 


we  get 


+  Z  y^n. 


For  a  margin  of  operation  zero,  the  evolution  will  fail  to  retrieve  the  ith 
component  of  the  ath  memory,  wf,  if  0.  As  before,  let  denote 

the  event  {A'l,"  <  0},  and  let  %„  =  Ur=i  the  event  that  one  or 

more  memory  components  is  not  stable. 
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Clearly  E(  1^)  =  0  and  it  is  also  easy  to  verify  that  Var(  K)  =  Cd  ')•  The 
following  result  then  follows  analogously  to  Theorem  3.1  with  virtually 
the  same  proof.  (The  situation  is,  in  fact,  simpler  in  the  zero-diagonal  case 
as  the  symmetric  nature  of  the  r.v.’s  Y),  ensures  that  Lemma  A.7  readily 
applies  in  this  instance.) 

Theorem  6. 1 .  Consider  a  zero-diagonal  higher-order  neural  network 
of  degree  d  with  weights  chosen  according  to  the  outer-product  algorithm 
of  Eq.  (5)  and  with  a  choice  of  margin  ^  =  0  m  the  evolution  rule  (4).  For 
any  fixed  e  >  0  and  cr  >  0: 

1 .  If  as  n  »,  we  choose  m  such  that 

=  (1  -  ^)n‘‘ 

^  2(2d  +  \){dy.  log  n 

[■  2  log  log  n  -H  2  log  2{2d  +  \)(dy.Ve  _  /log  log  n\] 

L  (2d  +  1)  log  n  \  (log  ny  /J’ 

then  the  probability  that  each  of  the  memories  is  a  fixed  point  «  ^  1  -  e; 

2.  If,  as  n-*  we  choose  m  such  that 


(I  -  m)n‘‘ 

■  log  log  n  +  log  2e(d  -i-  I)!  , 

/log  log  n\] 

2(d  +  1)!  log  n 

1  (d+Dlogn 

\  (log  ny  /  J 

then  the  expected  number  of  memories  that  are  fixed  points  is  Sm(  1  -  e). 

Corollary  6.2  For  a  given  degree  of  interaction  d  ^  1  and  margin 
35  =  0  the  sequence 


^  2(2d  H-  !)(</)!  log  n 

is  a  lower  capacity  for  the  zero-diagonal  outer-product  algorithm. 

Corollary  6.3  For  a  given  degree  of  interaction  d  >  1  and  margin 
35  =  0  the  sequence 


^  2(d+l)!logn 

is  a  weak  lower  capacity  for  the  zero-diagonal  outer-product  algorithm. 

Remarks.  Again,  for  d  =  I  we  can  sharpen  the  results  somewhat  using 
the  same  techniques  as  in  Section  4,  The  result  is  a  capacity  and  weak 
capacity  exactly  given  by  Corollaries  4.3  and  4.4,  respectively;  i.e.,  for 
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first-order  interactions  the  presence  or  absence  of  the  diagonal  terms 
makes  no  difference  to  the  capacity.  This,  as  seen  above,  is  not  true  for 
d  >  \,  however. 

Note  the  somewhat  surprising  result  that  the  zero-diagonal  capacities 
are  larger  than  their  nonzero  diagonal  counterparts  even  though  the  signal 
term  in  the  zero-diagonal  case  is  somewhat  lower  than  for  the  nonzero 
diagonal  case.  In  fact,  the  ratio  of  the  zero-diagonal  capacity  to  the  capac¬ 
ity  when  the  diagonal  terms  are  not  set  to  zero  is  the  rather  substantial 
factor  of  kjHd)'..  For  large  interaction  orders,  therefore,  the  outer-prod¬ 
uct  algorithm  with  diagonal  terms  set  to  zero  picks  up  a  factor  of  2‘'/VW 
in  capacity.  This  effect  can  be  traced  to  the  additional  noise  variance 
caused  by  the  diagonal  terms  when  they  are  present  (Eq.  (18));  the  growth 
in  the  noise  due  to  the  nonzero  diagonal  terms  exceeds  the  corresponding 
growth  in  the  signal  term.  In  particular,  adding  the  diagonal  terms  causes 
an  increase  in  the  signal  term  from  ("j')  to  n‘‘\  however,  the  corresponding 
growth  in  the  noise  variance  is  somewhat  larger,  from  (m  -  \)  ("5')  to 
(m  -  \)kjn‘‘. 


7.  ATTR.ACTORS  AND  DYNAMIC  CAPACITY 

The  capacity  results  derived  above  are  readily  extendable  when  the 
memories  are  required  not  just  to  be  stable,  but  to  be  attractors.  Let  u', 
.  .  .  ,  u"  e  B"  be  an  m-set  of  randomly  chosen  memories  and  consider  an 
outer-product  network  of  degree  d.  Fix  0  s  p  <  J,  and  let  u[a]  be  a 
randomly  chosen  state  within  the  Hamming  ball  of  radius  pn  surrounding 
an  arbitrarily  chosen  memory  h“.  We  will  require  that  system  dynamics 
map  u[a]  into  the  memory  u°  with  high  probability. 

As  before,  we  define  the  sequence  of  doubly  indexed  random  variables 
by 


X 


1.0 

n 


uf  S 


wf  S  S  «rMrM([ai. 


Setting 
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we  get 


A'i"  =  +  '2 


Note  that  by  the  sphere  hardening  effect  the  random  state  u[q]  will  lie 
on  the  surface  of  the  Hamming  ball  of  radius  pn  surrounding  the  memory 
u“  with  high  probability  for  large  n.  We  hence  have  that  the  estimate 
S'li"  —  «''(!  -  2p)‘‘  for  the  signal  term  holds  with  probability  approaching 
one  as  «  -» ao.  The  signal  term  is  reduced  from  its  maximum  value  of  n‘‘ 
because  of  the  slight  initial  mismatch  (essentially  pn  components)  be¬ 
tween  the  probe  vector  u[a]  and  the  memory  u“.  Now,  for  d  even  the 
noise  terms  TJ,'’  *'  are  symmetric  r.v.’s.  For  d  odd  we  can  write 

rr*'  =  (uruda]  -h  2 

=  (a-”  +  2 

where  the  r.v.  A'®  =  w"w,(al  has  mean  approaching  1  -  2p  for  large  /i, 
and  is  independent  of  the  symmetric,  i.i.d.,  ±I  r.v.’s  =  u“u*u)uj[a] 
for  j  /  I. 

The  evolution  rule  (3)  will  fail  to  retrieve  the  ith  component  of  the  ath 
memory,  if  £  98.  As  before,  let  iSi'"  denote  the  event  {Xli“  ^  ® }, 
and  let  =  Ur=i  be  the  event  that  one  or  more  memory 

components  are  not  retrieved  (i.e.,  is  not  98-stable).  We  are  interested  in 
the  probability,  I  -  P{^b},  that  each  of  the  fundamental  memories  at¬ 
tracts  a  randomly  chosen  state  in  the  Hamming  ball  of  radius  pn  surround¬ 
ing  each  memory  in  one  synchronous  step,  as  well  as  in  the  allied  weak 
sense  result. 

Let  Kj  be  as  defined  in  Eq.  (9),  and  let  p„  =  E{  f'!;®  ''}.  We  see  that  the 
arguments  used  in  the  proof  of  Theorem  3.1  continue  to  work  here,  albeit 
with  a  slight  reduction  in  the  value  of  the  signal  term. 

Theorem  7. 1 .  Fix  e  >  0,  xn  >  0,  and  choose  a  margin  di  =  mp„  in  the 
evolution  rule  (3)  for  the  outer-product  algorithm  of  degree  d.  For  any 
fixed  radius  of  attraction,  p  >  0: 

\.  If,  as  n  —*  'x>,  we  choose  m  such  that 


(1  -  qy)(l  -  2p)^n^  f  2  log  log  n  -E  2  log  2{d  +  QX^Ve 
2{2d  +  l)\,/logn  I  ^  {2d  +  Dlogn 
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then  the  probability  that  for  each  fundamental  memory  a  randomly  cho¬ 
sen  state  in  the  Hamming  ball  of  radius  pn  surrounding  the  memory  is 
mapped  into  the  memory  in  one  synchronous  step  /i  si  -  e; 

2.  //,  as  n-*  oo,  we  choose  m  such  that 


(1  -  m)(l  -  2p)^n‘‘  r  log  log  n  +  log  2e(d  +  l)\j 
2(2d  +  1)X<^  log  n  L  log  n 


then  the  expected  number  of  memories  which  attract  a  randomly  chosen 
state  in  the  Hamming  ball  of  radius  pn  surrounding  the  memory  in  one 
synchronous  step  is  sm(l  -  e). 

Corollary  7.2.  For  a  given  degree  of  interaction  d  s  1  and  a  fixed 
choice  o/O  s  p  <  i  the  sequence 


/(d)\(l  -  2p)^ 
V  (2d  +  1)! 


\  n<‘ 

/  log  n 


is  a  lower  p-attractor  capacity  in  one-step  synchronous  operation  for  the 
outer-product  algorithm  of  degree  d. 

Corollary  7.3.  For  a  given  degree  of  interaction  d  s  1  and  a  fixed 
choice  of0^p<ithe  sequence 


r-t  ^  -  2p)^2<^-'\  n^ 

V  {id)\{d  +1)  /  log « 

is  a  weak  lower  p-attractor  capacity  in  one-step  synchronous  operation 
for  the  outer-product  algorithm  of  degree  d. 

The  fixed  point  capacity  results  of  Corollaries  3.2  and  3.3  are  hence 
weakened  by  just  the  multiplicative  factor  (1  -  Ip)^'^  if  we  require,  in 
addition,  that  there  be  attraction  over  a  Hamming  ball  of  radius  pn  in  one 
synchronous  step.  Analogous  results  hold  for  the  zero-diagonal  case.  Spe¬ 
cifically 

Theorem  7.4.  Fix  c  >  0,  or  >  0,  and  choose  a  margin  of  zero  in  the 
evolution  rule  (4)  for  the  zero-diagonal  outer-product  algorithm  of  degree 
d. 

\.  If,  as  n  -*  30,  we  choose  m  such  that 
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(1  -  tzrKl  -  2p)^n‘‘  r  2  log  log  n  +  2  log  Kid  -t-  l)(rf)!Ve 
2{2d  +  l)(d)l  log  n  L  ^  (2d  +  i)logn 


then  the  probability  that  for  each  fundamental  memory  a  randomly  cho¬ 
sen  state  in  the  Hamming  ball  of  radius  pn  surrounding  the  memory  is 
mapped  into  the  memory  in  one  synchronous  step  /sal  -  e; 

2.  If,  as  n  —r  n’e  choose  m  such  that 

_  (1  -  ctKI  -  2p)^n*‘  r  log  log  n  +  log  2e(</  -Hi)! 

^  2(d  +  1)!  log  /I  L  (d  +  \)  log  n 


then  th"  expected  number  of  memories  which  attract  a  randomly  chosen 
state  in  the  Hamming  ball  of  radius  pn  surrounding  the  memory  in  one 
synchronous  step  is  am(l  -  c). 

Corollary  7.5.  For  a  given  degree  of  interaction  d  a  1  and  a  fixed 
choice  of  0  s  p  <  i  the  sequence 


,(p)  =  ( 


(I  -  2p)^  \  n<‘ 


2(2d  +  l)(d)!/  log  n 


is  a  lower  p-attractor  capacity  in  one-step  synchronous  operation  for  the 
zero-diagonal  outer-product  algorithm  of  degree  d. 

Corollary  7.6.  For  a  given  degree  of  interaction  d  a  1  and  a  fixed 
choice  of  0  ^  p  <  i  the  sequence 


log  n 


is  a  weak  lower  p-attractor  capacity  in  one-step  synchronous  operation 
for  the  zero-diagonal  outer-product  algorithm  of  degree  d. 

The  following  nonrigorous  argument  (as  in  McEliece  et  al.  (1987)) 
seems  to  indicate  that  if  we  allow  nondirect  convergence  to  the  memories 
then  we  can,  in  fact,  remove  the  factors  of  (1  -  2p)“  by  which  the 
capacity  is  reduced  if  we  insist  on  direct  convergence.  Consider  the  non¬ 
zero  diagonal  situation  again,  for  instance.  Fix  a  small  p*  >  0.  If  the 
number  of  fundamental  memories  is  chosen  to  be 
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(I  -  cr)(l  -  2p*)^n‘‘ 
(2d  +  l)Xd  log  n 


then  by  Theorem  7.1  each  fundamental  memory  directly  attracts  over  a 
Hamming  sphere  of  radius  p*n.  Let  p  <  i  be  the  desired  (fractional)  radius 
of  attraction.  Extending  Lemma  5.3  for  the  direct  convergence  case  (i.e., 
replacing  n  in  Eq.  (23)  by  /ip  =  (1  -  2p)n)  we  obtain  that  the  asymptotic 
probability,  t,  that  a  single  component  of  a  given  memory  is  incorrectly 
labeled  is  bounded  by 


log  n 


)■ 


It  is  easily  seen  that  t  -♦  0  as  n  “  if  the  desired  fractional  radius  of 
attraction,  p,  satisfies 


In  the  multiple  step  synchronous  case  the  probe  vector  has  essentially  pn 
components  incorrectly  specified.  The  first  synchronous  state  transition 
will  map  the  probe  vector  to  a  state  where  essentially  nr  components  are 
wrong,  with  high  probability.  For  any  fixed  p*,  however  small,  we  can 
choose  n  large  enough  so  that  the  probability  of  component  misclassifica* 
tion,  r,  becomes  smaller  still.  Thus,  for  large  enough  n,  the  probe  vector 
will  be  mapped  within  the  confines  of  a  Hamming  sphere  of  (small)  radius 
p*  surrounding  the  memory.  But  by  Theorem  7.1  the  next  state  transition 
will  converge  directly  to  the  fundamental  memory  with  very  high  proba¬ 
bility.  This  (nonrigorous)  argument  indicates  that  for  every  fixed  (small) 
p*,  and  every  choice  of  attraction  radius  p  satisfying  Eq.  (32),  we  can  find 
n  large  enough  that  any  randomly  chosen  state  in  the  Hamming  ball  of 
radius  pn  surrounding  the  memories  will  converge  to  the  corresponding 
fundamental  memories  within  two  synchronous  transitions.  Now,  keep¬ 
ing  fixed,  if  we  allow  p*  to  approach  zero  it  appears  that  the  factor  (1  - 
2p)‘‘  can  be  dropped  from  the  capacity  expression  for  large  enough  n.’ 

8.  Conclusions 

We  have  established  that  the  outer-product  algorithm  of  degree  d  (and 
its  zero-diagonal  variant)  can  store  at  least  of  the  order  of  n'^/log  n  mcmo- 

*  The  difficulty  in  makins  this  rigorous  is  that  we  must  estimate  the  probability  of  the 
conjunction  of  two  successive 'events:  one  mapping  a  ball  of  radius  pn  into  a  smaller  ball  of 
radius  p*n,  and  the  other  mapping  the  ball  of  radius  p*n  into  the  memory. 
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lies.  Open  questions  include  the  determination  of  tight  upper  capacities, 
rates  of  convergence,  and  capacities  when  more  than  one  synchronous 
step  is  allowed  in  the  dynamics,  and  extending  and  tightening  Newman's 
(1988)  description  of  the  energy  landscape  to  obtain  estimates  of  the  num¬ 
ber  of  memories  that  can  be  stored  when  a  certain  error-tolerance  is 
permitted  in  recall.  The  key  issue  here  is  whether,  as  in  the  case  d  =  1 ,  we 
can  gain  a  factor  of  log  n  in  capacity  if  errors  are  allowed  in  the  retrieval  of 
the  memories. 


Appendix  A:  Large  Deviations 

The  technical  lemmas  of  this  section  principally  deal  with  large  devia¬ 
tions  of  a  sum  of  random  variables  from  its  mean.  Lemma  A.  1  is  a  gener¬ 
alization  of  the  Chebyshev  inequality.  Lemma  A. 3  is  a  standard  approxi¬ 
mation  of  the  tail  of  the  normal  distribution  function.  Lemma  A.4  is  the 
classical  large  deviation  Central  Limit  Theorem  for  sums  of  (0,1)  random 
variables.  Lemma  A.6  outlines  an  inequality  for  generating  functions  in 
the  spirit  of  Khintchine's  inequality.  Finally,  Lemma  A. 7  is  a  large  devia¬ 
tion  result  which  applies  to  deviations  much  larger  than  those  handled  by 
the  Central  Limit  Theorem.  The  lemma  is  motivated  by  a  large  deviation 
result  due  to  Newman  for  symmetric  random  variables.  Lemmas  A.l  to 
A.4  are  standard  results  and  we  quote  them  without  proof  (cf.  Feller 
(1968),  for  instance). 

Lemma  A.l  (Generalized  Chebyshev  Inequality).  Let  be  a  mono- 
tonically  increasing  positive  function  on  the  real  line.  Let  Y  be  any  ran¬ 
dom  variable  and  suppose  that  E(i(r+(T))  exists.  Then  for  any  u 

^*{u) 

Similarly,  if  ib-  is  any  monotonically  decreasing  positive  function  with 
E(il>-(  T))  <  then 


Corollary  A. 2.  For  any  random  variable  Y  and  any  «  >  0 

P{Y  3:  u]  ^  inf  e-'“  Eie'^^).  (33) 

rsO 

P{  L  s  -u]  ^  inf  e''"  E(e  ’"'')■ 


(34) 
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As  usual,  in  the  following  we  denote  by  the  normal  density  function 

<p(x)  = 

and  by  4>  the  normal  distribution  function 


<l>(jc)  =  ifiiy)  dy. 

Lemma  A. 3.  ~  ^(x)/jr  ai  x  — »  ». 

Lemma  A. 4.  Let  {5^}  be  a  sequence  of  i.i.d.  random  variables  taking 
on  values  0  and  I ,  each  with  probability  i.  For  each  n  let  S„  =  i  ^j,  and 
let  Ok  =  P{5„  =  \nll  \  +  k},  and  put 


h 


2 


If  n  —*  *  and  k  is  constrained  to  an  interval  k<  where  /T,  =  o(n  then 

there  are  constants  A  and  B  such  that 


'hip(hk) 


(35) 


uniformly  in  Ar;  and,  in  fact,  hip(hk)  is  an  asymptotic  upperbound  for  an 
for  any  k.  Further, 


UoJ 


P{5,  a  [n/2]  +  K„}=  2  a* 

t-x. 


Corollary  A. 5.  Let  R„  denote  the  sum  of  n  i.i.d.  random  variables 
taking  on  values  - 1  and  I  only,  each  with  probability  Let  dt  =  P{R„  = 
k}.  If,  as  n—*  k  is  constrained  to  an  interval  k<  K„  where  K„  =  o{/i^) 
then 

f  =  0  if  n  -  k  is  odd 

yTn 

Lemma  A. 6.  Let  {^y}  be  a  sequence  of  i.i.d.  random  variables  taking 
on  values  -1  and  1,  each  with  probability  L  Let  U„  =  Then  for 

any  choice  of  positive  parameters  w  s  2  and  t  <  we  have 


)  if  n- 


is  even. 


n 


lim  sup 


)  <  ®. 
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Remark.  Note  that  the  function  is  of  the  form  expfalf/l”}  so  that 
Khintchine’s  inequality  which  requires  that  the  test  function  be  real  ana¬ 
lytic  with  all  its  derivatives  being  positive  at  the  origin  cannot  be  readily 
applied. 

Proof.  The  basic  strategy  is  to  show  that  the  sequence  of  partial  sums 
corresponding  to  the  Taylor  series  expansion  for  the  generating  function 
defined  above  converges  uniformly.  Accordingly,  we  first  estimate 
E(|f;,h-^)forz>0.  Set^y  =  (^;+  l)/2and  let  S„  -  5,.  Now  f/Jsa 

symmetric  random  variable  and  S„  =  {U„  +  n)/2.  We  have 

=  ln-^'2  =  k} 

=  2/1-^  2  k^  =  (.k  +  n)l2} 

*20 


/»0 


where  o/  =  P{5,  =  rn/2l  +  /}.  Choosing  i  <  t  <  §  we  effect  a  partition  of 
the  above  sum  into  three  partial  sums: 


l(ki»ii»/2j-l  i»’al  IviJ 

E(|t;,h-^)<  2  +  E  +  S 

1-0  Mno»  «vlI  i-Wah  i 

2i  Xj  2) 

Now 


2, 


< 


s  2/i'^flog  n)S 


and  using  the  results  of  Lemmas  A. 3  and  A.4  we  have 


Vw 


Further,  in  the  range  (log  n)/2  ^  I  ^  n'l2  we  have  from  Lemma  A.4  that 
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(/  +  1)'  ai 


1^2 

s  2'“^ 


I  ”  (^)( 

[/. 


1  +  O 

HI*  l/2V\'^ 
Hi-tnvVi  ■* 


where  we  have  overestimated  the  (1  +  o(l))  term  by  2.  It  hence  follows 
that 


2}  s  4 


[2(1*  l/JvVii 


X-  <fi(x)  dx 


(t-i’Hir  *  I) 

~  4  I  (pix)  dx 


~  4  f  x^  ^(x)  dx 

JO 

=  2^*'ir-''^  r((z  +  I)/2).'» 

As  the  upper  bounds  for  both  2|  and  approach  zero  with  n  it  follows 
that 


E{\U„Yn-^)  £  2‘'^*'ir-''2  r((z  +  l)/2). 

Using  Stirling’s  formula'*  we  then  obtain  for  large  k  and  fixed  <u  >  0  that 
& 

For  large  k,  the  Ath  term  of  the  partial  sum 

Qs='Z^ 

*-o  *• 

hence  decreases  exponentially  provided  w  s  2  and  /  <  w  As  the 
sequence  of  partial  sums  Qs  converges  to  uniformly  in  N,  it 

follows  that  <00.  ■ 

Lemma  A. 7.  Lei  D  >  2  be  some  fixed  parameter,  and  for  each  n  let 
be  a  sequence  of  independent  random  variables  (with  distribution 
function  f  J)  satisfying: 

I.  E(r;)  =  0; 

'®  The  gamma  function  is  defined  for  any  y  >  0  by  Tty)  =  /l  x'~'e~’  dx. 

"  For  fixed  u  >  0  and  k  large,  kt  ~  and  r(<iiit)  —  VZw  e 
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2.  lim^.  VardJ)  =  1; 

3.  There  is  a  number  x  >  0  such  that 

lim  sup  E{exp(x|r;:l^")}  <  x;  (36) 


4.  For  any  =  n[(log 

r  I'  dF'’„(t)  =  o(n-‘>%  (n  ^  =c).  (37) 

J 

Let  Mn  be  a  polynomially  increasing  sequence  of  integers  satisfying  = 
o(«®),  and  let  y„  be  a  sequence  satisfying  y„  =  CliVM„  log  M„),  y„  = 
o(Mn),  and  such  that  for  some  positive  y  <  x 

y,  s  {2>d-2vd  yj^f^y>a(D-i)  pgj 


Then  for  any  w  >  0 


ns,.}.M.cxp(-t4^). 


(n  -*  *). 


Remarks.  The  above  lemma  is  a  generalization  of  a  large  deviation 
result  for  symmetric  random  variables  due  to  Newman  (1988).  Note  that 
condition  4  imposes  a  sort  of  “asymptotic  symmetry”  on  the  random 
variables  T^.  In  the  application  of  the  lemma  to  higher-order  networks  we 
will  choose  a  parameter  D  slightly  larger  than  the  degree  of  interaction  d. 

The  deviations,  y^,  encountered  in  the  lemma  can  be  chosen  to  be  as 
large  as  which  are  much  larger  than  the  VW„,  deviations  of 

the  Central  Limit  Theorem. 

The  proof  follows  a  standard  truncation  argument  (cf.  Newman,  1988). 
We  will  in  fact  show  results  slightly  stronger  than  claimed,  viz.. 


p{|  n  =  r.)  =  o  (m.  exp  (-  5^)) 

for  the  range  of  M„  we  will  be  interested  in.  This  estimate  can  be  further 
tightened  by  strengthening  some  of  the  cruder  bounds  in  the  proof. 

Proof.  Define  the  truncated  random  variables 


1^0  otherwise. 
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By  a  straightforward  argument  it  follows  that 


r, 


fi 


Choosing  r  =  y  in  Eq.  (33)  and  invoking  condition  3  (recall  that  y  <  jr)  we 
get  for  any  choice  of  tn  >  0 


P,  s  M„  E{exp(y|r;:|^0)} 


M 
^exp 


(- 


(1  - 

2M,  )' 


(n 


00). 


(39) 


(The  choice  of  constant  |  is  solely  for  algebraic  convenience  and  does  not . 
affect  the  capacity  results.)  Similarly,  choosing  r  =  yJMn  in  Eq.  (34)  we 
get 

P,  =  [e  exp  (- 

-  'xp{-  ^  log  ECf-'.''.".))}.  MO) 


Claim.  E  =  1  +  y\P.M\  +  o{yl/Ml). 

Proof.  Setting  K„  =  iyl/2yM„y^  we  have 


/  dF^it) 


with  the  latter  equality  following  because  7;  has  zero  mean.  Using  the 
lower  bound  on  y„  and  the  fact  that  Af,  is  polynomially  increasing,  we 
have  fi„  =  n{(log  so  that  by  condition  4  and  the  bounds  on  y„  we 
have 


|E(f:)(  =  o(n-^^)  =  o(A/;’^)  =  o(^).  (41) 

Further,  condition  4  also  ensures  that 

E{(/J)^}  =  i  dF:(t)  -  1,  (n  00).  (42) 
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Define  the  function  g  by 

g(u)  =  e""  —  I  +  u  —  u^/2. 

To  prove  the  claim  it  suffices  now  to  show  that 

lim  sup  ^  (^^)|}  =  D  <y^.  (43) 

In  fact,  if  Eq.  (43)  holds  them  for  any  5  >  0  we  can  choose  r(8)  such  that 
D/r(8)  <  8/2.  With  such  a  choice  of  r(8)  we  can  now  choose  n  large 
enough  that 


Ml 

sup  ^  g 

|/)s/(S)  Tn 


\Af./  2' 


Hence,  if  Eq.  (43)  holds,  then  for  every  fixed  8  >  0  we  can  choose  n  large 
enough  so  that 

If 

"  f  L„...  I*  (s;)l 

*  sup  ^  I*  (^)| 


Ml^l  _i  y„r„\  ,  .  , 


whenever  Eq.  (43)  holds,  and  by  Eqs.  (41)  and  (42)  this  would  establish 
the  claim.  As  g(u)  ^  cu^e'"  for  some  finite  c  and  all  «,  it  suffices  hence  to 
show  that 


lii^up  E  jlnP  exp  (-  < 
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Now,  by  the  truncation  of  Tn  and  the  bounds  on  y„  it  follows  that 

V  f*  1 


It  hence  follows  that 

•in^up  E  [|7^P  exp  (-  ^  lim  sup  E{|7''|V'^:i'"l. 

As  y  >  0,  the  exponential  dominates  the  third  power  when  T!,  assumes 
large  values.  Using  the  fact  that  y  <  x  we  can  now  invoke  condition  3  to 
establish  Eq.  (44).  This  establishes  the  claim. 

As  yJMn  -►  0,  we  have  from  Eq.  (40)  that 

Then,  for  every  w  >  0 

txp  (-  (45) 

Equations  (39)  and  (45)  complete  the  proof.  ■ 
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Abstract 

We  consiler  recurrent  neural  networks  of  polynomial  threshold  units.  We  study  the  expected 
number  of  fixed  points  in  the  case  of  random,  syrrunetric  interactions,  equivalent  to  higher  order 
spin  glass  models  of  statistical  physics.  We  derive  precise  asymptotic  estimates  for  the  expected 
number  of  fixed  points  as  a  function  of  the  margin  of  stability.  In  particular,  we  show  that  there 
is  a  critical  range  of  margins  of  stability  (depending  on  the  degree  of  interaction)  such  that  the 
expected  number  of  fixed  points  with  margins  below  the  critical  range  grows  exponentially  with 
the  number  of  nodes  in  the  network,  while  the  expected  number  of  fixed  points  with  margins 
above  the  critical  range  decreases  exponentially  with  the  number  of  nodes  in  the  network.  We 
also  briefly  examine  the  random  energy  model. 


1  INTRODUCTION 

Recurrent  networks  of  formal  neurons  have  been  popular  in  a  variety  of  computational  applications. 
The  model  neurons  in  such  structures  are  typically  linear  threshold  elements  which  compute  the 
sign  of  a  linear  form  of  the  inputs.  A  recurrent  network  results  when  such  elements  are  fully  inter¬ 
connected,  and  as  in  any  recurrent  system,  the  fixed  points  are  important  in  the  characterisation 
of  the  computations  done  by  the  structure.  A  particular  case  of  interest  results  when  the  inter¬ 
connections  between  neurons  are  symmetric:  in  such  cases  network  dynamics  are  regulated  by  a 
Hamiltonian  or  energy  function  (cf.  Hopfield  [1]  for  instance).  In  such  an  instance,  we  can  imagine 
the  state  space  of  the  network  to  be  embedded  in  an  energy  landscape  with  fixed  points  residing 
at  energy  minima.  A  classical  application  of  such  networks  is  in  associative  memory  where  neural 
interactions  are  adjusted  so  that  memories  are  stored  as  local  attractors. 

We  consider  here  a  natural  extension  of  the  model  to  recurrent  networks  comprised  of  higher 
order  neurons  which  compute  the  sign  of  a  polynomial  form  of  the  inputs.  The  added  degrees  of 
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freedom  in  specifying  the  polynomial  interaction  coefficients  can  be  expected  to  enrich  the  compu¬ 
tational  dynamics  that  result.  Distinct  features  emerge,  however,  in  the  analysis  of  these  structures 
depending  on  whether  the  higher  order  interactions  are  programmed  (or  “learnt”)  or  random. 

In  the  programmed  scenario,  the  goal  is  to  tailor  the  higher  order  interaction  coefficients  so  as 
to  obtain  desired  dynamical  behaviours;  this  leads  naturally  to  questions  of  capacity  and  efficiency 
of  higher  order  networks  of  given  degree  of  polynomial  interaction.  In  two  concurrent  papers  [2,  3] 
we  present  rigourous  results  on  algorithmic  capacity  and  efficiency  in  programmed  situations  for 
higher  order  networks  (cf.  also  Newman  [4]).  The  main  results  can  be  summarised  briefly  as  follows: 
the  computational  gains  in  higher  order  networks  parallel  the  extra  degrees  of  freedom  in  specifying 
the  polynomial  interaction  coefficients;  in  particular,  regardless  of  the  algorithm  used  to  specify  the 
interaction  coefficients,  the  information  storage  capability  of  a  higher  order  network  is  of  the  order 
of  one  bit  per  interaction  coefficient. 

Higher  order  systems  where  the  polynomial  interactions  are  random  may  be  useful  as  models 
of  disordered  systems  in  statistical  physics  (spin  glasses),  or  of  neural  networks,  before  any  learning 
has  occured,  or  in  the  limiting  case  when  too  much  learning  has  occured  (the  onset  of  senility!). 
These  will  be  our  focus  of  analysis  in  this  paper:  in  particular,  we  consider  recurrent,  higher  order 
neural  networks  with  symmetric,  random  polynomial  interactions.  We  characterise  the  flxed  points 
of  these  structures  according  to  their  margin  of  stability^  which  is  a  measure  of  how  stable  a  fixed 
point  is  with  respect  to  perturbations.  Our  main  result  may  be  informally  stated  as  follows: 

There  exists  a  critical  range  of  margins  of  stability  (depending  on  the  degree  of  polyno¬ 
mial  interaction)  such  that  the  expected  number  of  fixed  points  with  margins  of  stability 
below  the  critical  range  increases  exponentially  in  the  size  of  the  network  while  the  ex¬ 
pected  number  of  fixed  points  with  margins  of  stability  above  the  critical  range  decays 
exponentially  as  the  size  of  the  network  is  increased. 

There  is  thus  a  threshold  phenomenon  in  evidence  for  the  expected  number  of  fixed  points  around 
the  critical  range  of  the  margin  of  stability.  The  fact  that  for  a  certain  range  of  margins  the 
expected  number  of  fixed  points  grows  exponentially  with  the  number  of  nodes  in  the  network  is 
not  unexpected;  more  counter-intuitive,  perhaps,  is  the  existence  of  a  critical  margin  of  stability 
above  which  the  expected  number  of  fixed  points  actually  decays  as  more  nodes  are  added.  We  also 
provide  exact  asymptotic  expressions  for  the  coefficients  and  exponents  in  the  regime  of  exponential 
behaviour,  and  evaluate  the  critical  margins  of  stability.  While  considerable  attention  has  been 
focused  on  spin  glass  models  in  the  statistical  physics  literature,  at  the  time  of  writing  rigourous 
results  appear  to  have  been  confined  to  the  case  of  linear  interactions  and  to  estimates  of  the 
expectation  of  the  total  number  of  fixed  points  (cf.  Edwards  and  Tanaka  [5],  Gross  and  Mezard  [6], 
and  McEliece  and  Posner  [7]).  The  estimates  derived  here  provide  a  finer  partition  of  the  expected 
number  of  fixed  points  grouped  according  to  their  margins  of  stability,  and  extend  the  results  to 
higher  order  cases  with  polynomial  interactions  where  the  statistical  dependences  grow  more  acute. 

The  basic  analytical  tool  used  is  Laplace’s  method  for  integrals.  The  assumed  random,  in¬ 
dependent,  and  symmetric  nature  of  the  interactions  makes  for  some  simplicity  in  analysis.  The 
results  derived  here  for  the  disordered  case  may  also  give  some  intuition  in  programmed  situations 
where  the  interaction  dependences  are  weak,  though  a  corresponding  analysis  of  the  number  of  fixed 
points  for  the  'grammed  case  is  typically  more  complicated.  The  analysis  for  the  programmed 
case  depends  .  gly  on  the  algorithm  of  choice,  and  is  made  harder  by  the  presence  of  statistical 


'In  this  context  notion  i«  dne  to  Komld*  and  Paturi  [9]. 
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dependences  in  the  interaction  coefficients,  especially  in  higher  order  cases.  Rigourous  estimates  for 
the  total  number  of  fixed  points  are  known  only  for  the  case  of  linear  interactions  programmed  with 
the  outer  product  algorithm:  McEliece,  et  al  [8]  conjectured  based  on  the  corresponding  situation 
with  random  interactions  that  the  number  of  extraneous  fixed  points  is  exponential  in  the  number 
of  nodes  and  this  was  rigourously  shown  by  Koml6s  and  Paturi  [9].  The  issue  remains  open  for 
other  algorithms  such  as  the  spectral  algorithm  (cf.  Venkatesh  and  Psaltis  [10])  even  for  the  linear 
interaction  case.  For  the  higher  order  cases  no  formal  results  have  been  shown  for  any  algorithm. 

In  Section  2,  we  formally  introduce  recurrent  higher  order  networks,  and  make  precise  the  no¬ 
tion  of  the  margin  of  stability  of  a  fixed  point.  We  also  show  that  when  the  polynomial  interactions 
are  symmetric,  network  dynamics  are  regiilated  by  an  energy  function.  In  Section  3  we  consider 
random,  homogeneous,  higher  order  networks  and  prove  the  main  Theorem  3.3;  Table  1  contains  a 
listing  of  critical  margins  of  stability  for  certain  fixed  degrees  of  interaction;  Corollaries  3.5  and  3.7 
highlight  two  principal  applications  of  the  main  theorem  in  showing  an  explicit  expression  for  the 
expected  number  of  fixed  points  in  the  exponential  regime  for  the  cases  where  the  degree  of  interac¬ 
tion  is  fixed,  and  increases  unboundedly  with  the  number  of  nodes  in  the  network,  respectively;  and 
Table  2  lists  coefficient  and  exponent  values  for  the  exponential  re^me  for  certain  fixed  degrees  of 
interaction.  In  Section  4  we  deal  with  non-homogeneous  networks,  and  also  briefly  examine  a  dif¬ 
ferent  model  of  randomness  known  in  the  literature  as  the  random  energy  model  (cf.  Derrida  [11]). 
The  proofs  of  the  main  theorems  are  developed  in  the  body  of  the  paper,  while  technical  lemmas 
and  caJculations  are  confined  to  the  two  appendices. 

A  brief  word  on  notation:  in  addition  to  standard  asymptotic  conventions,  we  write  Xn  ^  Vn 
if  *n  <  Vn  for  n  large  enough;  all  logarithms  in  the  exposition  are  to  base  e;  we  also  denote  by  IB 
the  set  {-1, 1}. 

2  POLYNOMIAL  THRESHOLD  NETWORKS 

We  consider  systems  of  n  densely  interacting  threshold  units  each  of  which  yields  an  instantaneous 
state  -1  or  -|-1.  (This  corresponds  in  the  literature  to  a  system  of  n  Ising  spins,  or  alternatively,  a 
system  of  n  neural  states.)  The  state  space  is  hence  the  set  of  vertices  of  the  hypercube.  A’e  will 
in  this  discussion  also  restrict  our  attention  throughout  to  symmetric  interaction  systems. 

Let  be  the  family  of  all  subsets  of  cardinality  d  -f  1  of  the  set  {1,2, . . .  ,n}.  Clearly  jljl  = 

(  ^"  j  )  .  For  any  subset  I  of  {1,2,. ..,n},  and  for  every  state  u  =  {ui,ti2» •  •  - , Wn}G  B",  set 

=  n.e/«.- 

Definition  2.1  A  homogeneous  polynomial  threshold  network  of  degree  d  is  a  network  of  n  thresh¬ 
old  elements  with  interactions  specified  by  a  set  of  (  ,7  ,  )  real  coefficients  wj  indexed  by  I  in 
Jd.  The  evolution  rule  is  asynchronous,  and  for  t  €  {I,--  -  ,«}  is  given  by 

«/’=sgn(  5^  .  (1) 

A  (non-homogeneous)  polynomial  threshold  network  of  degree  d  is  a  network  of  n  threshold  elements 
with  interactions  specified  by  a  set  of  ]Cj=i  (  coefficients  w/  indexed  by  I  in  (Jj_i  Xj, 
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and,  for  t  €  .  ,n},  the  asynchronous  evolution  rule 


=  sgn 


(2) 


These  networks  are  readily  seen  to  be  natural  generalisations  to  higher  order  of  the  familiar  case 
of  linear  threshold  networks  (d  =  1).  These  systems  can  be  identified  either  with  higher  order  spin 
glasses  at  zero  temperature,  or  higher  order  neural  networks.  Starting  from  an  arbitrary  config¬ 
uration  or  state,  the  system  evolves  asynchronously  by  a  sequence  of  single  “spin”  flips  involving 
spins  which  are  misaligned  with  the  instantaneous  “field.”  The  dynamics  of  these  symmetric  higher 
order  systems  are  regulated  by  higher  order  extensions  of  the  classical  quadratic  Hamiltonian.  We 
define  the  homogeneous  Hamiltonian  of  degree  d  by 

Hdiu)  =  -  (3) 

leid 


In  like  fashion  we  define  the  (non-homogeneous)  Hamiltonian  of  degree  d  by 


Hdin)  --YdY. 

i=i  l€l, 


(4) 


We  briefly  sketch  the  proof  of  the  following  result. 

Proposition  2.2  The  functions  Hd  and  Hd  are  non-increasing  under  the  evolution  rules  (1)  and 
(2),  respectively. 


Proof;  We  consider  the  case  of  Hd,  the  non-homogeneous  case  being  similar.  Assume  u  u"*"  is 
a  mapping  along  some  arbitrary  trajectory  in  state  space  for  a  homogeneous  polynomial  threshold 
network  of  degree  d.  The  proposition  is  trivially  true  if  u  is  a  fixed  point.  Consider  the  case  where 
u  and  are  distinct.  By  the  nature  of  the  asynchronous  mapping  u  and  u'*'  differ  only  in  a  single 
component.  Without  loss  of  generality  assume  the  i-th  component  of  u  changes  sign:  uf  =  -u, 
and  uj  =  Uj  if  j  /  i.  Now  consider  6Hd{u)  =  Hd(vL^)  -  Hd{'a.).  Factoring  out  «,•  in  equation  (3) 
we  can  write 

Hdiu)  =  -  u,  5^  w/u/XO)  “  Z) 


Hence 

SHd^v)  =  2«<  Yh 

leid.i^l 

By  assumption  we  have  «,•  =  -u+  =  -sgn  (jT/er^iMe/  £  0-  I 


In  the  terminology  of  spin  glasses,  the  state  trajectories  of  these  higher  order  networks  can  be 
seen  to  be  following  essentially  a  zero-temperature  Monte  Carlo  (or  Glauber)  dynamics.  Because 
of  the  monotonicity  of  the  Hamiltonians  given  by  equations  (3)  and  (4)  under  the  asynchronous 
evolution  rule  (1)  [resp.  (2)],  the  system  always  reaches  a  stable  state  (fixed  point)  where  the 
relation  (1)  [resp.  (2)],  is  satisfied  with  =  u,-  for  each  of  the  n  spins  or  neural  states. 
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Definition  2.3  Let  5  be  a  non-negative  parameter  (possibly  depending  on  n).  A  fixed  point, 
u  €  B",  of  a  homogeneous  polynomial  threshold  network  of  degree  d  is  B-stable  if  it  satisfies 

«,  i=l,...,n.  (5) 

leld.i€l 


In  like  fashion,  a  B-stabU  state,  u  £  B”,  of  a  non-homogeneous  polynomial  threshold  network 
of  degree  d  satisfies 


d 


UiY^  WIUf\^iy>B,  t=l,...,n.  (6) 

j=i  leij.iei 


For  5-stable  states,  B  represents  the  margin  of  stability  for  the  fixed  point;  we  hence  refer  to  B 
as  the  margin.  We  would  expect  5-stable  states  with  large  margins  to  tend  to  exhibit  correspond¬ 
ingly  large  basins  of  attraction,  i.e.,  to  be  stable  with  respect  to  relatively  large  perturbations.  Note 
that  according  to  this  definition,  all  fixed  points  are  0-stable  states.  Komlos  and  Paturi  [9]  utilise 
a  similar  notion  in  an  analysis  of  the  extraneous  stable  states  for  the  case  of  a  linear  interaction 
network  (d  =  1)  programmed  using  the  outer  product  algorithm. 


3  HOMOGENEOUS  NETWORKS 

3.1  Higher  Order  Spin  Glasses 

Consider  homogeneous  polynomial  threshold  networks  whose  weights  u>/,  I  £  Id  are  i.i.d.,  A'’(0, 1) 
randdm  variables.  This  is  a  natural  generalisation  to  higher  order  of  Ising  spin  glasses  with  Gaussian 
interactions.  We  will  derive  an  asymptotic  estimate  for  the  expected  number  of  5-stable  states  of 
the  structure.  Asymptotic  results  for  the  number  of  0-stable  states  (fixed  points)  for  the  usual  case 
d  =  1  of  linear  threshold  networks  with  Gaussian  interactions  have  been  reported  in  the  literature 
(cf.  [5,  6,  7]).  We  extend  the  technique  used  by  McEliece  and  Posner  [7]  to  obtain  the  general 
result. 

As  a  function  of  n,  let  d^  explicitly  represent  the  degree  of  the  homogeneous  threshold  network, 
with  the  constraint  d„  <  n  -  1.  To  avoid  trivialities  we  restrict  ourselves  to  n  >  3.  For  any  given 
n,  and  mar^n  5  >  0,  let  denote  the  expected  number  of  5-stable  states  of  a  homogeneous 

network  of  degree  d„.  In  the  following  we  will  estimate  F^n,dn,B)  under  various  assumptions  on  dn 
and  5. 

Let  P(n,dn,B)  denote  the  probability  that  a  given  state  u  is  5-stable.  Clearly,  F^^  dn,B)  = 
2" Pf^n,d„,B)-  Without  loss  of  generality  we  assume  that  u  =  (1,1,...  ,1).  For  each  n  and  i  = 
1, . . .  ,  n,  define  the  sequence  of  random  variables  Sl^  by 

Sn=  53 

/6Td„:«€/ 

For  u  to  be  5-stable,  we  require  that  5^  >  5  for  1=  1,  . . . ,  n. 

Now,  for  each  n,  the  random  variables  5^,  *  =  1,. ..  ,n  are  zero-mean,  identically  distributed, 
and  jointly  normal.  Set 


1/2 
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Then  we  have 


Now,  let  {c„}  be  the  sequence 


and  define  the  sequence  of  functions  /„  by 


if  i  =  j 
if, Vi. 


/n(0  =  log$(0  - 


it  +  B/p„y 

2Cn 


(8) 

(9) 


where,  in  usual  notation,  $(t)  =  is  the  normal  distribution  function,  and  ip{s)  = 

is  the  standard  normal  density  function. 


Proposition  3.1 


B(n.dn,8) 


y/2KCn  J-oo 


(10) 


Proof:  We  use  the  principle  of  equivalent  Gaussians.  Let  ,A'’‘  be  i.i.d.,  A/’(0,1) 

random  variables.  Construct  the  random  variables  V,!, »  =  1, . . .  ,n  by 

Y:  =  q„X'^-\-PnX\ 

The  random  variables  are  jointly  normal,  and  have  the  same  expectations  and  covariances 


as  the  random  variables  Hence 

B(n,d„,8)  =  P  {5^  >  5,  ,' =  1,. ..  ,n}  = 

p{y;>6,  i=  1,...  ,n} 

= 

i=l . nj 

1  Pn  Pn  J 

= 

r  P  (X'  >  ,  =  1 . n]  q,{t)dt 

J—oo  1  Pn  Pn  ) 

= 

J —CO  \Pn  Pn  / 

The  result  follows  from  the  defining  equations  (8)  and  (9).  I 


We  will  estimate  the  expected  number  of  5-stable  points  given  by  equation  (10)  for  large  n 
using  variants  of  Laplace’s  technique  to  estimate  the  integral  (cf.  de  Bruijn  [12],  for  instance). 
Rather  careful  asymptotic  estimates  are  required,  however,  as  the  integral  depends  critically  on  the 
functions  /„,  and  these  depend  both  on  n  and  the  interaction  orders  d„. 

It  will  be  convenient  to  consider  margins  of  the  following  form: 

B  =  0PnCn,  O  >  0,  /9  >  0. 

For  given  degrees  of  interaction,  dn,  the  expected  number  of  5-stable  states  will  depend  solely  on 
the  choices  of  the  parameters  o  >  0  and  0  >  0. 


Baldi,  Venkatesb 


7 


3.2  0-Stability 


Define  the  positive  function 

♦(0  /i„ 


In  Appendix  A  (lemma  A.l)  we  show  that,  for  every  pven  degree  d„  and  margin  0,  the  function 
/n(t)  has  a  unique  maximum  at  t  =  t„  where  satisfies 


B 


(11) 


Note  that  depends  implicitly  on  both  the  margin  B  and  the  degree  of  interaction  dn-  The 
following  lemma,  which  we  prove  in  Appendix  B,  provides  the  sought  after  estimate  for 


Lemma  3.2  Let  B  =  BPn^n  0  <  a  <  1  and  >  0.  If  dn  =  o{n),  then 


f'(n,d„.8)  ~ 


We  are  now  in  position  to  state  the  main  theorem. 


(12) 


Theorem  3.3  Let  dn  =  o(n)i  and  consider  margins  of  the  form  B  =  Bpny/^,  with  0  >  0.  Then 
there  are  constants  and  02,  with  0  <  0i  <  02,  as  n  — ►  oo; 


a)  F(n,dn,B)  increases  exponentially  with  n  whenever  0  <  0  <  0\; 

b)  F(n,d„,B)  decreases  exponentially  with  n  whenever  0  >  02. 


Proof:  We  consider  the  two  cases,  {dn}  bounded,  zmd  {d„}  unbounded  separately. 

Case  1.  {dn}  is  bounded. 

From  equation  (8)  it  is  clear  that  Cn  ~  dn  is  bounded.  Consequently,  from  equations  (11),  (25), 
and  (26)  it  follows  that  in  equation  (12)  the  term  /n(^n)  is  bounded  while  the  term  v/— c„/"(tn)  is 
bounded  away  from  zero.  It  is  clear  then  that,  for  large  n,  the  behaviour  of  F(n,d„,B)  ^^s  0  varies  is 
determined  entirely  by  the  sign  of  the  exponent  in  equation  (12).  Now,  from  equation  (25)  we  have 


Utn)  =  - 


2 


where  tn  is  bounded  and  satisfies  equation  (11).  It  is  easy  to  verify  that  if/?  =  0  then  l  +  /n(<n)/log2 
is  positive  and  bounded  away  from  zero,  i.e.,  the  expected  number  of  fixed  points  (0-stable  states) 
increases  exponentially  with  n  (see  Table  2  for  a  listing  of  exponents  for  some  fixed  degrees  of 
interaction).  Now,  for  every  n,  F(n,d„,B)  decreases  monotonically  as  0  increases  (the  expected 
number  of  0-stable  states  is  a  monotonically  decreasing  function  of  the  margin),  and  an  examination 
of  the  above  asymptotic  estimate  for  /„(<„)  shows  that  as  0  increases  /n(t„)  eventually  decreases 
sufficiently  for  1  -f-  /„(t„)/log2  to  become  negative.  Recalling  that  dn  takes  values  only  in  some 
finite  set,  by  assumption,  from  the  above  equation  we  can  find  Q  <  0i  <  02  such  that 


(‘+s) 


_^_£L  +  log - ^2 - 

2  V^(fn+01V^). 

_  M  +  loe _ — _ 

y/3^  2  %/2jr(fn  +  02^^), 


-1,  (13) 

-1.  (14) 
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As  \/—Cnfn{tn)  IS  bounded  above  and  away  from  zero,  it  follows  from  equation  (12)  that  for  every 
0  <  Pi  there  is  €{0)  >  0  such  that  similarly,  for  every  /3  >  we  can  find 

6{0)  >  0  such  that 

Case  2.  dn  —*  oo  such  that  dn  =  o(n). 

For  a  choice  of  margin  B  =  /Jpn>/^>  >  0,  we  have  from  equation  (9)  that 


/n(fn)  =  log  #(<„)- 


2Cn 


=  -E 


t=l 


2c„ 


(15) 


Further,  using  equation  (24)  and  the  asymptotic  form  for  the  error  function,  we  have 

Substituting  from  equations  (15),  (24),  and  (26)  in  equation  (12)  we  then  have 


Setting  01  =  02  =  y/2  log  2  in  the  theorem,  it  is  clear  that  exponentially  increasing  behaviour  attains 
when  0  <  ^  <  ^2  log  2,  while,  for  0  >  y/2  log  2,  the  expected  number  of  B-stable  states  decreases 
exponentially.  To  complete  the  proof  we  need  to  show  that  F^n4„,B)  increases  exponentially  with 
n  when  0  =  0.  But  this  follows  immediately  because  the  expected  number  of  B-stable  states  is  a 
monotonically  decreasing  function  of  the  margin.  I 


For  dn  =  d  =  constant,  and  margin  B  =  0Pny/^y  tlie  critical  quantities  0i  =  02  =  0*  of 
equations  (13)  and  (14)  can  be  precisely  calculated.  The  critical  values  0"  are  listed  in  Table  1 
for  a  range  of  fixed  interaction  orders.  Note  that  the  critical  values  0*  appear  to  increase  to  a 
maximum  around  d  =  25,  and  then  decrease  monotonically. 


d 

0' 

d 

ki 

1 

0.0690 

1 

1.0505 

0.2874 

2 

0.1214 

2 

1.1320 

0.4265 

3 

0.1557 

3 

1.2178 

0.5124 

4 

0.1792 

4 

1.3031 

0.5721 

5 

0.1960 

5 

1.3868 

0.6165 

10 

0.2349 

10 

1.7784 

0.7382 

25 

0.2476 

25 

2.7867 

0.8541 

50 

0.2316 

50 

4.2207 

0.9104 

100 

0.2023 

100 

6.7176 

0.9466 

1000 

0.0959 

1000 

39.3421 

0.9917 

Table  1:  Critical  values  of  margin,  0*,  for  var-  Table  2:  The  behaviour  of  the  expected  num- 
ious  choices  of  fixed  degree,  d  ber  of  fixed  points,  ~  kj,2'^*,  for  dif¬ 

ferent  values  of  fixed  degree  of  interaction,  d. 

More  explicit  results  can  be  deduced  from  Lemma  3.2.  In  the  range  where  the  expected  number 
of  H-stable  points  increases  exponentially,  the  multiplying  coefficients  and  exponents  can  themselves 
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be  precisely  calculated  given  the  interaction  orders  d„.  Particular  cases  of  importance  result  when 
{d„}  converges  to  some  d  >  0,  and  in  particular,  the  case  d„  =  d  =  constant,  and  the  case  d„  -+  oo 
monotonicaUy. 

Consider  the  case  where  d„  =  d  =  constant.  Let  o  >  0  and  /?  >  0  specify  the  margin  B,  and 
let  s  be  the  unique  solution  of  the  equation 


V»(s)  = 


3  +  /3d“ 


The  location,  t„,  of  the  maximum  of  /„  (satisfying  equation  (11))  can  be  approximated  up  to  terms 
of  the  order  of  n“^  by 

'”  =  ■’  +  s  +  °  (^)  ’ 

where  k  is  independent  of  n  and  satisfies 


K  = 


d(d-H)[3  +  (l-a)/?d°] 
d  +  (a  +  /?d“)[s(d+  l)-»-;3d“]' 


(16) 


Using  the  above  approximation  for  in  Lemma  3.2,  and  collecting  all  terms  up  to  the  order  of 
n~*  in  the  exponent  in  equation  (12)  yields  the  following  result. 

Corollary  3.4  //d„  =  d  >  0,  and  B  =  Ppn^n  ot  >  0  and  /?  >  0,  then,  as  n  -*  oo,  F^n,d„,B)  ~ 
where  the  multiplying  coefficient,  kj,  and  exponent,  are  independent  of  n  (and  depend 
solely  on  the  interaction  order,  d,  and  the  margin  parameters,  a  and  0);  specifically. 


=  1  - 


d  + 


2dlog2 


2s0d°  0^<P° 

d+1  d  + 


and  kj  can  be  expressed  in  the  form  C  e^  where 


C  = 


v/2x(s  +  0d°)\ 


•  1/2 

y 


s2(d  +  1)  +  S0do{d  +  2)  +  0^(p°  +  d 


and  with  k  as  in  equation  (16), 


-  2s{k  -  0dr{\  -  a)}  +  0'^d^°{l  -  2a)  +  2d] 
a0<P{d  +1)  K 


2d 

-  ^^d“-‘  - 


s  +  /?d®  s  +  0d°‘' 

An  important  special  case  results  when  we  choose  the  margin  to  be  identically  zero. 

Corollary  3.5  //d„  =  d  >  0,  the  expected  number  of  fixed  points  is  asymptotically  ~  ^^2"^“'  where 


wj  =  1  - 


1 

log  2 


(d+  l)s2 


2d 


+  log 


f  Sy/^\ 

W  I’ 


d  +  s2(d+l) 


exp 
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Table  2  lists  the  exponent  uj  and  the  multiplying  coefficient  kj  for  various  choices  of  fixed  interaction 
order  d  with  a  choice  of  zero  margin. 

The  monotonicity  of  ^  yields 

Corollary  3.6  Let  B  =  0Pn<^  be  the  margin.  //d„  -+  oo  such  that  dn  =  o{n)  then: 

a)  the  expected  number  of  B-stable  states  increases  exponentially  with  n  »/0  <  o  <  1/2  and 

0>O; 

b)  the  expected  number  of  B-stable  states  asymptotically  tends  to  zero  asn—^oo  if  a>l/2 
and  /3  >  0. 

Note  from  Table  2  that  as  d  becomes  large  F(n,d„,B)  approaches  2".  This  is  supported  by  the 
following  result  which  gives  the  number  of  fixed  points  (0-stable  states)  when  the  interaction  orders 
are  allowed  to  grow  large. 

Corollary  3.7  If  as  n  — ►  oo,  for  any  fixed  choice  of  r  with  0  <  t  <  1,  {d„}  satisfies  d„  = 
ft[n/(logn)’’],  and  dn  =  o{n),  then  the  expected  number  of  fixed  points  (zero  margin)  is  given  by 
F^n.dnfi)  ~  asn-*oo,  where 

u  _  ^ 

2>/2jrlogd„’ 

logdn  log  log  dn  log(\/4x/e) 
d„log2'^  2d„log2  d„log2  • 

Proof:  Consider  the  exponent  in  the  integrand  of  equation  (10).  We  have 
-/»(<.)  =  nlc.g*(l„)  -  -  £  r  ♦(-'")*  - 

Using  the  asymptotic  formula  for  the  tails  of  the  normal  distribution  (cf.  Feller’s  text  [13],  for 
instance)  together  with  Lemma  A. 3  (equation  (23)),  and  equations  (8)  and  (26)  in  equation  (12) 
completes  the  proof.  I 

Note  that  for  this  case,  the  multiplying  coefficient  kd„  and  exponent  assume  particularly 
simple  closed  form  expressions  depending  solely  on  the  interaction  order  d„.  Note  also  that  u>d„  1 
as  n  -♦  oo,  as  is  expected.  The  growth  of  d„  with  n  is  rather  rapid  in  Corollary  3.7.  Results  akin  to 
Corollary  3.7  can  be  computed  for  slower  rates  of  growth  of  d„  (for  instance,  dn  =  n“,  0  <  o  <1). 
We  do  not  yet  have  rigourous  results,  however,  for  the  case  where  d„  scales  linearly  with  n. 

4  NON-HOMOGENEOUS  NETWORKS 

4.1  Higher  Order  Spin  Glasses 

The  non-r  mogeneous  case  has  several  more  degrees  of  interconnection  freedom.  The  results  of  the 
last  section  can,  however,  be  simply  extended  to  this  case. 
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Analogously  with  equations  (7)  and  (8)  let 


and 


k  J 

Life=i  V  / 


and  to  every  choice  of  margin  (fixed  /?  >  0  and  a  >  0)  6  =  /3pnC^  in  the  homogeneous  case 
associate  a  margin  B  =  Ppn^  in  the  non-homogeneous  case.  Define  the  sequence  of  functions  /„ 
(corresponding  to  equation  (9))  by 


fn{t)  =  log  $(t)  - 


2cn 


(17) 


Let  j  jjj  denote  the  expected  number  of  5-stable  states  of  a  non-homogeneous  algebraic 
threshold  network  of  degree  d„  with  Gaussian  interactions. 


Proposition  4.1 


Proof:  For  i  =  1, . . . ,  n  set 


f  .  -  /**  e"/"(0  di 


(18) 


=  S  H 

i=i 

Noting  that  d  6)  ~  >  6 ,  t  =  1, ...  ,n},  the  proof  follows  the  same  outline  as  that  for 

Proposition  3.1.  I 


Theorem  4.2  //d„  =  o(n)  then  ~  as  n  oo. 

Proof:  We  use  the  following  inequality  due  to  Blake  and  Darabian  [14].  Set  r  =  d/(n  —  d  +  1). 
Then  .  . 

^  1  i  J )  < 

1  -  r  d(n  -  d+ 1)(1  -  r)2y  (”)  ^ 

For  dn  =  o{n)  we  hence  have 
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The  analysis  in  Theorem  3.3  now  continues  to  hold  in  toto. 


I 


So  far  we  have  considered  relatively  small  interaction  orders,  d„  =  o(n).  A  theoretically 
important  case  results  when  is  allowed  to  grow  linearly  with  n.  In  fact,  as  dn  approaches  n, 
almost  all  dichotonoies  of  2”  points  in  binary  n-space  can  be  separated  by  a  non-homogeneous 
network  (Venkatesh  and  Baldi  [2]).  It  is  useful,  hence,  to  estimate  the  number  of  fixed  points, 
F{n,dn.o)^  for  the  random  case  when  d„  grows  linearly  with  n. 

Theorem  4.3  //d„  ~  n/2  then  ~  2"/(n  +  1)  os  n  oo. 


Proof:  If  d„  ~  n/2  then  in  =  n[l  ±  0{\ly/n)\.  Hence  from  equations  (17)  and  (18), 


n  +  1 


/  dt  < 

J^nT 


where  u„  =  fl(y^).  Fix  0  <  r  <  1/4.  Then 

[$(nT)n+i  _  $(-n^)"+i]e-""V«^"  1 

n  +  1  V2x 

while 

0  <  ‘  /  it  <  n_:*(;£ri±2t£ri.] . 

~  J\t\>nr  ^  ^  ~  V  n  +  1  J 

Now  $(n’')"'*'*  — ►  1,  ^(-n’')"'*’^  -*  0,  and  n^^fvn  -*  0  as  n  -*  cx>.  Hence 

d;  4- 

The  proof  is  completed  by  substitution  in  equation  (19). 


4.2  Random  Energy  Model 

The  dynamics  of  the  symmetric  interaction  systems  considered  above  are  characterised  by  Hamil¬ 
tonians  or  energies.  The  determination  of  the  number  of  fixed  points  of  such  a  system  is  hence 
equivalent  to  counting  the  number  of  states  which  form  (local)  energy  minima.  For  higher  order  spin 

glasses,  the  energy  of  each  state  given  by  equation  (3)  is  an  ^  ^0,  (  J*  ~  ^  random  variable. 

The  energies  of  different  states  are  dependent,  identically  distributed  normal  random  variables. 

The  random  energy  model  (cf.  Derrida  (11))  is  an  allied  system  which  assigns  energies  as  i.i.d., 
A/’(0, 1)  random  variables  to  the  vertices  of  the  hypercube.  State  energies  are  now  independent 
normal  random  variables.  Such  an  assignment  of  state  ener^es  results  in  random  acyclic  orien¬ 
tations  of  the  vertices  of  the  hypercube  (cf.  Baldi  (15])  defined  by  state  transitions:  u  »-»  v  iff 
.H^(u)  >  n{v).  For  any  given  assignment  of  energies,  the  corresponding  acyclic  orientation  can  be 
realised  by  a  (non-homogeneous)  threshold  network  with  degree  dn  =  n.  In  particular,  we  have  2" 
interaction  coefficients  for  such  a  system  so  that  any  particular  assignment  of  2’‘‘  state  ener^es  can 
be  realised  for  a  particular  choice  of  coefficients. 

Let  Gn  be  the  number  of  local  energy  minima  corresponding  to  a  random  acyclic  orientation. 
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Theorem  4.4 


E(G„)  = 
Var(G„)  = 


n+  1 

(n-  1)2’'-* 
(«  +  l)2  ■ 


Proof:  Let  s',  i  =  1, . . . , 2",  enumerate  the  vertices  of  the  hypercube.  For  i  =  1, . . . ,  2'*,  let  the 
random  variable  /’  be  the  indicator  for  state  s',  i.e., 


p  _  f  1  if  8*  is  an 
1  0  otherwise. 


energy  minimum 


We  then  have  (?„  =  The  probability,  p  =  P  {/’  =  1},  that  any  particular  state  is  a  local 

minimum  is  just  the  probability  that  it  is  assigned  a  lower  energy  value  than  any  of  its  n  nearest 
neighbours  (at  Hamming  distance  one  from  it).  As  the  assigned  energies  are  i.i.d.  random  variables, 
we  have  p  =  l/(n  +  1).  Hence  the  expected  number  of  fixed  points  is  2’'/(n  +  1).  We  now  compute 
the  joint  probability  that  two  states  s'  and  are  energy  minima.  Let  represent  the  Hamming 
distance  between  s'  and  s^.  It  is  easy  to  see  that 


P{rP  =  1}  = 


l/(n  +  1)2  if  dij  >2 
l/n(n  +  1)  if  dij  =  2 
0  if  dij  =  1 

l/(n  +  1)  if  dij  =  0 


Now,  we  have 


Jfl  y  jW  V  2 

V«r(G„)  =  E(g;)  -  (EG.)’  =  =  1)  -  (EPlr  =  1}) 

<=1  }=l  \i=l  / 

=  1 2  V  _i_  +  2  V  — 

n+l^  n(n+l)^  (n  +  l)2  U+1/ 


t<3  t<j 

dij  —  2  dij  >  2 

Collecting  terms  and  simplifying  yields  the  final  result.  I 

Note  that  the  result  of  Theorem  4.4  provides  anecdotal  support  for  the  result  of  Theorem  4.3 
as  a  sort  of  limiting  result.  Stronger  results  can  be  shown  for  the  random  energy  model:  the  number 
of  fixed  points,  G„,  exhibits  central  tendency.  Let  denote  the  normalised  r.v. 


g;  = 


Gn-EGr, 


Theorem  4.5  There  is  an  absolute  positive  constant  C  such  that  for  every  x 

|P  (G;  <  x)  -  $(z)|  <  C2-° 

We  refer  the  reader  to  the  papers  by  Baldi,  et  al  (16,  17]  for  a  proof  of  the  theorem.  It  is  an 
open  question  whether  the  number  of  fixed  points  of  higher  order  spin  glasses  also  exhibits  central 
tendency. 
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5  CONCLUSIONS 

We  have  rigourously  estimated  the  expected  number  of  stable  points  of  higher  order  spin  glasses  with 
generalised  Gaussian  interactions.  The  critical  feature  observed  here  is  the  threshold  phenomenon 
that  is  evidenced  in  the  expected  number  of  fixed  points  around  a  range  of  degree  dependent  critical 
margins.  For  margins  below  the  critical  range  we  have  shown  a  precise  exponentially  increasing 
form  of  the  solution,  while  for  margins  greater  than  the  critical  range  we  have  shown  that  the 
expected  number  of  fixed  points  decreases  exponentially  with  the  number  of  nodes  in  the  network. 
Open  questions  remain  on  a  more  precise  determination  (than  the  mean)  of  the  number  of  fixed 
points  (as  a  function  of  the  margin),  and  in  particular,  on  whether  there  is  central  tendency  as  in 
the  random  energy  model. 

The  results  of  this  paper  appear  to  have  relevance  to  the  programmed  situation  where  interac¬ 
tion  strengths  are  to  be  chosen  for  which  specified  collections  of  binary  n-tuples  are  fixed  points  with 
some  desired  radius  of  attraction.  In  such  cases  it  is  important  to  be  cognisant  of  the  number  of 
extraneous  fixed  points — and  their  radii  of  attraction — that  are  developed  incidentally.  Rigourous 
results  have,  however,  been  shown  only  for  the  linear  interaction  case  (d  =  1)  with  interactions 
programmed  by  the  outer  product  algorithm  (Koml6s  and  Paturi  [9]).  The  analysis  appears  to  be 
substantially  harder  for  higher  order  cases,  even  for  the  relatively  simple  outer  product  algorithm 
(cf.  Newman’s  earlier  paper  [4]  and  our  two  concurrent  papers  [2,  3]  for  illustrations  of  the  difficul¬ 
ties  caused  by  the  more  severe  statistical  dependences  in  higher  order  cases).  The  extraneous  fixed 
point  structure  of  other  algorithms,  such  as  the  spectral  algorithm  (Venkatesh  and  Psaltis  [10]),  is 
even  less  understood,  especially  in  the  higher  order  versions.  It  is  not  readily  apparent  whether 
the  results  derived  here  for  the  case  of  random,  symmetric  interactions  (especially  Theorem  3.3 
and  the  corollaries)  can  be  utilised  in  a  rigourous  analysis  in  programmed  cases;  nonetheless,  these 
results  may  provide  qualitative  indications  of  behaviours  that  may  be  expected  in  programmed 
cases,  especially  when  the  dependences  are  weak. 
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A  Properties  of  /„ 

Lemma  A.l  For  each  n  (and  any  choice  of  margin,  B  >  0,  and  degree,  dn): 

a)  fn  is  a  convex  D,  strictly  negative  function  with  a  unique  maximum  at  t  =  t„; 

b)  for  t  >  1,  fn  increases  monotonically  to  — l/c„  as  t  —*  oo. 


Proof:  First  we  claim  that  V'  is  a  positive,  monotone  decreasing  function.  Clearly  ij}{t)  >  0  for 
all  t.  Consider  rl/{t)  =  -ttj){t)  -  0(0’-  V’'(0  <  0  for  t  >  0.  Now,  for  t  >  0  consider 


vK-0  .  <p(-0 
♦(-0  tIiv>(-0 


t  >  0. 
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Hence  ^'(1)  <  0  for  aJJ  <  so  that  0  js  monotone  decreasing.  By  repeated  differentiation  of  equa¬ 
tion  (9)  we  have 


m  = 

Cn 

(20) 

m  = 

-  r- 

(21) 

Now  /"(O  =  ~  f/cn  <  0  for  all  t  so  that  /«  is  strictly  convex  0,  while  the  monotonicity  of 

0  guarantees  a  unique  solution  at  t  =  to  =  0.  As  /n(t)  <  0  for  all  t  by  inspection  of 
equation  (9),  part  (a)  follows.  Now  note  that 

[f0(O]' = -0(f)[t^  +  <0(0  -  1]  ^ <  0  (<>!)• 

Hence  both  0(<)  and  <0(<)  decrease  monotonically  to  zero  so  that  (b)  follows.  I 

Lemma  A. 2  For  each  n,  /„  has  derivatives  of  all  orders,  and  in  fact,  for  k  >  3,  the  derivatives 
are  independent  of  n  and  have  the  representation 

1*/2J  k-V 

/i‘>(0  =  E  E  (22) 

1=0  m=l 

where  the  coefficients  c|^  are  real  constants  independent  of  n,  and  Cq*j  =  (  —  1)*'“*. 

Proof:  Note  that  for  Jfc  >  3  we  have  fn^\t)  =  The  result  follows  by  induction.  I 

Lemma  A.3  Let  B  =  PpnC^  with  a  >  0  and  /?  >  0,  and  let  /„  achieve  its  maximum  at  t„.  Then 
as  n  —f  oo; 

a)  if  dn  -*  d,  then  <„  — ►  s  where  s  satisfies 

0(s)-^  = 

b)  if  dn  — ►  oo,  and  a  =  0  or  0  =  0,  then 

tn  =  [21og  Cn  -  loglogCn  -  log(4T)  O  ;  (23) 

c)  i/d„  -»  00,  0  <  a  <  1,  and  0  >  0,  then 

tn  =  2(1  -  a)logc„  -  21og(/?>/2T)  -  O 


(24) 
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Proof:  Part  (a)  follows  by  continuity  of  as  Cn  dn  (n  -*  oo)  from  equation  (8).  Parts  (b) 
and  (c)  can  be  verified  by  direct  substitution.  I 


Remarks: 


/;(<n)  =  0, 

~  k>3,  <1,-  00.  (27) 

Cn 


Note  that  for  k  >  3,  fn‘\t)  does  not  depend  on  n  any  more  so  that  uniform  bounds  can  be 
obtained. 


We  will  seek  to  approximate  the  functions,  /n(0»  f®"'  terms  of  a  Taylor  series 

expansion.  Specifically,  for  particular  choices  of  c „  >  0  and  >  0  we  use 


|f  -  tn\  <  Sn 


fnit)-fn{tn)-fnitn) 


{t-tn? 


<  <T 


it  -  tn? 


(28) 


The  next  two  lemmas  outline  conditions  under  which  the  above  holds. 


Lemma  A.4  If  the  sequence  {d,,}  is  bounded  then  for  any  specification  of  margin,  B  =  Ppn^  ^ith 
a  >  0  and  0  >  0,  and  for  any  e  >  0,  we  can  find  6  >  0  uniform  with  respect  to  n  such  that 
equation  (28)  holds  with  a  choice  ofcn  =  (  ond  6„  =  6. 


Proof:  Set  g„(t)  =  f„(t)  -  /„(<„)  -  f"(t„)(t  -  in?  12.  We  have  5„(t„)  =  5;(t„)  =  =  0. 

Applying  the  Mean  Value  Theorem,  we  can  find  0  <  a  <  1  such  that,  +  qQ)  =  C^(^n  +  ^C)) 
while  +  ttC)  =  <>(®C)  as  C  — ►  0.  Hence  5n(<n  +  C)  =  o(C*)»  (C  0)-  Thus,  for  each  n,  and 
every  c  >  0,  we  can  find  >  0  such  that  |Sn(0I  <  “  ^n)^/2  whenever  |t  -  t„|  <  6„. 

Now  assume  without  loss  of  generality  that  dn  takes  values  from  the  finite  set  {p^,. . .  ,p^}. 
For  i  =  1, ...  ,K,  set 


m 

gV) 


(t+0(prf 

2p' 


log$(<)  - 


•A2 


m-nn-nv) 


(t  - 1') 


where  /'  has  its  maximum  at  t'.  Then  for  every  e  >  0  there  exists  ^'  >  0  such  that  |5'(t)l  < 
c(t  —  t*)^/4  whenever  |t  -  l‘|  <  S'.  Now  c„  =  d„  +  0(l/n)  so  that  from  equation  (11)  it  follows 
that  t„  =  t'  +  0(l/n)  for  some  i  €  K).  (As  0  is  monotone  decreasing,  we  have: 


tn-i'<x 


■  a -i) /&■'<>&■ 
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As  dn  is  bounded  we  are  guaranteed  that  t'  is  also  bounded,  so  that  the  result  follows.]  From 
equations  (9),  (25),  and  (26)  it  hence  follows  that 

ip«(‘)i  =  ij'(oi  +  0  (^)  < '  +  0  (i)  (i<  -  (ni  <  n 

We  cam  hence  choose  N  such  that  for  n  >  iV,  we  have  |ffn(OI  <  “  ^n)^/2  whenever  jt  - 

tn|<nun{^*,. . .  We  can  finally  choose  a  smallest  6  =  min{^i, . . .  . . .  ,6^}  to  establish 

uniformity.  I 


Lemma  A. 5  Let  B  =  PpnC^  he  the  margin.  If  dn  -*  oo  as  n  —>■  oo,  then  equation  (28)  holds  for  n 
large  enough  for  the  following  choices  of  in 

a)  e„  =  Alogc„/c„  and  =  A/v/128 log c„  i/a  =  0  or  /?  =  0;  . 

b)  (n  =  Av^ogc„/cJ,“"  and  6n  =  A/(8/3(l  —  a)v/logc„)  i/  0  <  a  <  1  and  /3  >  0. 

Here  A  >  0  is  a  suitably  small  (but  fixed)  choice  of  parameter. 

Proof;  We  will  prove  the  result  for  the  case  {a  =  0  or  /3  =  0);  the  proof  for  the  case  {0  <  a  <  1 
and  >  0}  is  similar. 

Consider  a  choice  of  maj^n  B  =  PpnCn  with  a  =  0  or  =  0.  Set  e„  =  A  log  c„/cn  and 
6n  =  A/v^l28logc„  for  some  A  >  0  to  be  specified  suitably  small.  In  the  proof  of  Lemma  A.4  set 
€  =  €„.  Now,  it  suffices  to  show  that  |if(,(tn  +  01  <  ^n|CI/2  whenever  |C|  <  hn  =  A/v^l281ogc„, 
(n  —*  oo).  We  have 

l!;;(<.+()l  =  l/n(<n+C)-/;(ln)a 
By  the  Mean  Value  Theorem,  there  exists  0  <  /?  <  1  such  that 

l/;(<n  +  OI  =  iC|  |/;'(<n  + /301- 

Now  consider 


\fnitn  -  Sn)  +  /"(fn)hn|  =  6n\  -  /"(<«  -  /?««)  +  f^Ml  <  in\  -  /"(«n  -  ^n)  +  /"(fn)]- 

The  last  inequality  follows  from  Lemma  A.l(b)  as  /"  is  negative  and  increases  monotonicaJly  to 
-1/cn  for  large  t,  and  by  Lemma  A.3(b)  which  ensures  that  ~  y/2  log  d„  —*  oo,  (n  -*  oo).  Using 
equation  (21)  with  Lemma  A.3(b),  as  n  — >  oo  we  have 

|/;(<n  -  6n)  +  f'MSnl  <  K  |^(<n  -  Snf  +  (<„  -  h„)V'(f„  -  ^n)  -  V'(fO"  "  <nV’(tn)l 

<  LtnSn-Sl  _  1 1  +  0  f - . 

We  have  =  0(A)  so  that  for  A  sufficiently  small 

y/26ntn  /j] 


l/;(<n-^n)  +  /"(«n)hn|  /S 


~  €fi 


6n 


(n  ^  oo). 


Similarly 

Mu  -  e.)  -  K(u)(n\  i  f»  j  (n  00). 

By  Lemma  A.l,  the  above  inequalities  hold  in  the  j„-neighbourhood  of  t„,  and  this  completes  the 
proof.  I 
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B  The  Main  Lemma 

We  prove  here  Lemma  3.2,  restated  below  for  convenience. 

Lemma  3.2  Let  B  =  I3pn<^  vrith  0  <  a  <  1  and  >  0.  If  dn  =  o(n),  then 

Proof:  We  will  consider  separately  the  cases  where  {dn}  is  bounded  and  {</„}  is  unbounded. 
Case  1.  {d„}  is  bounded. 

The  sequence  /”(<«)  is  bounded  strictly  away  from  both  zero  and  infinity.  Hence  set  ^  = 
inf  |/"(<n)|  >  0  and  k  =  sup|/"(<„)|  <  oo.  Fix  c  arbitrarily  in  the  open  interval  (0,0-  Choose 
^  >  0  uniform  with  respect  to  n  by  Lemma  A.4.  By  Proposition  3.1  we  have 


2^y/n  '  "■  J\t-tn\<S  J\t-tn\>S 


Let  1?  be  a  parameter,  |t?|  <  €.  Consider 


-2x 


-2>r  r'-*  ^ 

-  U(/"(tn) + t?)J 


By  Lemma  A.4  it  then  follows  that 


r  -2t  1*/^  ,  /•*»+*  ,,, 

^  ln(/n(^n)  +  f)J  y  V"  j 

where  q{6)  >  0  depends  solely  on  6.  As  (  was  arbitrary  we  have 

/  dt  ~  (n  ^  oo). 

Jt„-S  Vn/"(t„)y 

Let  (  =  sup|/„(tn)|.  The  sequence  {/n(tn)}  is  bounded  so  that  (  <  co.  For  each  n  set 


/l„(«)  =  max{/„(t„  -6)-  /n(tn),  /n(tn  +  «)  “  fn{tn)}  <  0. 


Then 


Ut 


f  g"/n(0  _  g»»/n(<n)  f  g*>(/n(0“/"(*n))  , 

J\t-tn\>S  J\t-t„\>S 

<  g"/n(*n)  /g("-i)fc«(S)~/i«(*")  f  i(i)e~^^ dtl 

I  J  ■ 


(30) 


(31) 
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Now  let  h{6)  =  sup„/»„(^).  As  {dn}  is  bounded,  Lemma.  A.l  ensures  that  h{6)  <  0  strictly. 
Furthermore,  for  each  n,  0  >  >  /„(t)  -  /„(<„)  whenever  |t  -  t„|  >  S,  by  Lemma  A.l.  Hence 

/  dt  <  (\/2Tc„e("-»)M«)+cl . 

^|t-<n|>5  I  J 

Hence,  there  exists  7(j),  p(S)  >  0  such  that 
fit  I 

~  - «  <»  -  <»). 

so  that  equation  (12)  follows. 

Case  2.  d„  — ►  oo  such  that  d„  =  o(n)  as  n  -♦  oo. 

We  prove  the  result  for  a  choice  of  margin  with  a  =  0  or  /?  =  0.  Fix  A  >  0  and  choose 
tn  =  A  log  Cn/c„,  and  S„  =  A/v'128 log c„.  (Note  that  c„  ~  d„  from  equation  (8).)  Now,  from 
equations  (23)  and  (26),  for  |t9|  <  („  and  for  small  A,  as  n  — »  oo, 

-(/».) +  '>)=^^|l±0(  A)). 

Cn 

As  n/cn  ~  n/dn  -♦  oo,  (n  — »  oo),  the  first  terra  on  the  right  hand  side  of  equation  (29)  dominates 
the  second,  so  that  for  a  sufficiently  small  choice  of  A,  equation  (30)  continues  to  hold: 

r^n+^fi  -  /  ,  t  !  \  f  — 2ir  \ 

L.‘  'i^) 

Now  by  Taylor’s  formula  we  have 

/„(<n  ±  «n)  -  MU)  =  +  I  £■“■((  -  UfCWit. 

By  Lemma  A.2  and  equations  (23)  and  (27)  we  have 


5/  it-tn?fAt)dt  <  ^  sup  |C{0l  ^  (1  +  0(1)) 

2  Jtn  2  2V2k 


(1  +  0(1)). 


1024c„  '  ■  ' 

Substituting  from  equation  (25)  we  then  have  for  a  small  enough  choice  of  A  that 

MU  ±  «n)  -  MU)  =  (1  -  0(A)|11  +  0(1))  (n  -  oo). 

Substituting  in  equation  (31)  we  have  as  n  — >  oo 


Noting  that  ^  =  sup|/„(f„)|  is  finite,  and  that  n/c„  ~  n/<i„  -♦  oo  as  n  -♦  oo,  equation  (12)  follows 
from  equations  (32)  and  (33)  by  choosing  A  suitably  small. 

The  proof  for  a  choice  of  margin  B  =  /JpnCj  0  <  a  <  1  and  ^  >  0  is  similar  (with 
equation  (24)  giving  the  asymptotic  form  for  in  this  case).  I 


REFERENCES 


20 


References 

[1]  J.  J.  Hopfield,  “Neural  networks  and  physicaJ  sytems  with  emergent  collective  computational 
abilities,”  Proc.  Natl.  Acad.  Set.  USA,  vol.  79,  pp.  2554-2558,  1982. 

[2]  S.  S.  Venkatesh  and  P.  Baldi,  “Programmed  interactions  in  higher-order  neural  networks: 
maximal  capacity,”  Journal  of  Complexity,  to  appear. 

[3]  S.  S.  Venkatesh  and  P.  Baldi,  “Programmed  interactions  in  higher-order  neural  networks:  the 
outer-product  algorithm,”  Journal  of  Complexity,  to  appear. 

[4]  C.  M.  Newman,  “Memory  capacity  in  neural  network  models:  rigourous  lower  bounds,”  Neural 
Networks,  vol.  1,  no.  3  pp.  223-238,  1988. 

[5]  S.  F.  Edwards  and  F.  Tanaka,  “Analytical  theory  of  the  ground  state  properties  of  a  spin  glass: 
I.  Ising  spin  glass,”  Jnl.  Phys.  F,  vol.  10,  pp.  2769-2778,  1980. 

[6]  D.  J.  Gross  and  M.  Mezard,  “The  simplest  spin  glass,”  Nucl.  Phys.,  vol.  B240,  pp.  431-452, 

1984. 

[7]  R.  J.  McEliece  and  E.  C.  Posner,  “The  number  of  stable  points  of  an  infinite- range  spin  glass 
memory,”  JPL  Telecomm,  and  Data  Acquisition  Progress  Report,  vol.  42-83,  pp.  209-215, 

1985. 

[8]  R.  J.  McEliece,  E.  C.  Posner,  E.  R.  Rodemich,  and  S.  S.  Venkatesh,  “The  capacity  of  the 
Hopfield  associative  memory,”  IEEE  Trans.  Inform.  Theory,  vol.  33,  pp-  461-482,  1987. 

[9]  J.  Koml6s  and  R.  Paturi,  “Convergence  results  in  an  associative  memory  model,”  Neural 
Networks,  vol.  1,  no.  3,  pp.  239-250,  1988. 

[10]  S.  S.  Venkatesh  and  D.  Psaltis,  “Linear  and  logarithmic  capacities  in  associative  neural  net¬ 
works,”  IEEE  Trans.  Inform.  Theory,  vol.  IT-35,  pp.  558-568,  1989. 

[11]  B.  Derrida,  “Random-energy  model:  limit  of  a  family  of  disordered  models,”  Phys.  Rev.  Lett., 
vol.  45,  pp.  79-82,  1980. 

[12]  N.  G.  de  Bruijn,  Asymptotic  Methods  in  Analysis.  New  York:  Dover,  1981. 

[13]  W.  Feller,  An  Introduction  to  Probability  Theory  and  its  Applications,  vol.  I.  New  York:  Wiley, 
1968. 

[14]  I.  F.  Blake  and  H.  Darabian,  “Approximations  for  the  probability  in  the  tails  of  the  Binomial 
distribution,”  IEEE  Trans.  Inform.  Theory,  vol.  IT-33,  pp.  426-428,  1987. 

[15]  P.  Baldi,  “Neural  networks,  orientations  of  the  hypercube,  and  algebraic  threshold  functions,” 
IEEE  Trans.  Inform.  Theory,  vol.  34,  pp.  523-530,  1988. 

[16]  P.  Baldi  and  Y.  Rinott,  “Asymptotic  normality  of  some  graph  related  statistics,”  Jnl.  Appl. 
Prob.,  vol.  26,  pp.  171-175,  1989. 

[17]  P.  Baldi,  Y.  Rinott,  and  C.  Stein,  “A  normal  approximation  for  the  number  of  local  maxima  of 
a  random  function  on  a  graph,”  in  Probability,  Statistics,  and  Mathematics:  Papers  in  Honor 
of  Samuel  Karlin,  (eds.  T.  W.  Anderson,  K.  B.  Athreya,  and  D.  L.  Iglehart).  New  York: 
Academic  Press,  1989. 


Si'ural  S^cly^ork5.^/o\  3.  pp  613-623.  l‘m 
Prink’d  Ml  the  USA.  All  rights  reserved 


S3  (K)  ♦  <10 

Copviiglil  ‘  19*^1  Ptr^.inn'ii  Pfess  pie 


ORIGINAL  CONTRIBUTION 

Shaping  Attraction  Basins  in  Neural  Networks 


Santosh  S.  Venkatesh  and  Girish  Pancha 

Moore  School  of  Electrical  Engineering.  University  of  Pennsylvania 

Demetri  Psaltis 


Department  of  Electrical  Engineering.  California  Institute  of  Technology 

AND  Gabriel  Sirat 


Croupe  Optice  de  Matcricle,  Ecole  Nationale  Supericure  de  Telecommunication 

(Received  19  January  1989;  revised  and  accepted  22  February  1990) 

Abstract— An  interesting  duality  between  two  formally  related  schemes  for  neural  associative  memory  is  exploited 
to  shape  the  attraction  basins  of  stored  memories.  Considered  are  a  family  of  spectral  algorithms — based  on 
specifying  the  spectrum  of  the  matrix  of  weights  as  a  function  of  the  memories  to  be  stored— and  a  class  of  dual 
spectral  algorithms— based  on  manipulations  of  the  orthogonal  subspace  of  the  memories,  which  are  expanded 
here.  These  algorithms  are  shown  to  attain  near  maximal  memory  storage  capacity  of  the  order  of  n,  and  are 
shown  to  typically  require  the  order  of  n''  elementary  operations  for  their  implementation.  Signal-to-noise  ratio 
arguments  are  presented  showing  a  duality  in  the  error-correction  behaviour  of  the  two  schemes:  the  spectral 
algorithm  demonstrates  memory -specific  attraction  around  the  memories,  while  the  dual  spectral  algorithm 
demonstrates  direction-specific  attraction.  Composite  algorithtns  capable  of  joint  memory-specific  and  direction- 
specific  attraction  are  presented  as  a  means  of  variably  shaping  attraction  basins  around  desired  memories. 
Computer  simulations  are  included  in  support  of  the  analysis. 

Keywords — Associative  memory.  Network  dynamics. 


1.  INTRODUCTION 

In  this  paper  we  develop  the  duality  between  two 
methods  for  training  a  fully  connected  network  of  n 
McCulloch-Pitts  neurons  (McCulloch  &  Pitts,  1943). 
The  sum  of  outer  products  is  perhaps  the  most  often 
used  training  method  for  such  networks  (Nakano, 
1972;  Amari,  1977;  Hopfield,  1982).  The  memory 
storage  capacity  for  this  method  is  n/4  log  n  (Mc- 
Eliece,  Posner,  Rodermich,  &  Venkatesh,  1987;  Psaltis 
&  Venkatesh,  1989)  whereas  the  maximal  theoretical 
capacity  for  any  storage  algorithm  is  2n  (Cover,  1965; 
Venkatesh,  1986b).  The  spectral  algorithm  (Kohonen, 
1977;  Personnaz,  Guyon,  &  Dreyfus,  1985;  Venka¬ 
tesh  &  Psaltis,  1989)  and  an  algorithm  we  will  refer 
to  as  the  dual  spectral  algorithm  (Maruani,  Chev- 
allier,  &  Sirat,  1987)  are  algorithms  whose  capacities 
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approach  the  theoretical  maximum.  In  this  paper, 
we  briefly  review  these  two  algorithms,  establish  the 
relationship  between  them,  and  define  how  a  proper 
choice  of  parameters  specifies  their  error  correction 
properties. 

In  such  networks,  memories  to  be  stored  are  typ¬ 
ically  programmed  as  fixed  points  of  the  structure. 
Error  correction  is  obtained  by  attracting  to  one  of 
the  stored  fixed  points,  initial  states  (or  probes)  of 
the  system  that  arc  close  to  the  fixed  points.  We  show 
that  in  the  spectral  scheme  the  radius  of  attraction 
around  each  of  the  stored  stable  states  is  controlled 
by  the  relative  size  of  the  eigenvalues  of  the  inter¬ 
connection  matrix.  The  dual  spectral  algorithm,  on 
the  other  hand,  leads  to  a  method  for  programming 
the  shape  of  the  attraction  basin  around  each  of  the 
elements  of  the  stored  vectors.  We  present  a  new 
method  based  on  linear  programming  for  selecting 
the  parameters  of  the  dual  spectral  algorithm  which 
determine  its  attraction  dynamics  around  each  stored 
fixed  point  and  we  suggest  a  hybrid  algorithm  that 
can  provide  more  arbitrary  control  of  the  shape  of 
the  attraction  basin. 

We  consider  a  fully  interconnected  network  of  n 
McCulloch-Pitts  neurons  with  the  instantaneous  bi- 
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nary  outputs  ( - 1  or  1)  of  each  of  the  neurons  being 
fed  back  as  inputs  to  the  network:  if  u,(t], 

.  .  .  ,  u„[/]  are  the  outputs  of  each  of  the  n  neurons 
in  the  network  at  epoch  /,  then  the  neural  update  of 
the  /th  neuron  results  in  a  new  state  at  epoch  t  +  1 
according  to  the  familiar  threshold  rule: 

M,(f  +  1)  =  A  K,U,[1]  - 

where 


The  mode  of  operation  may  be  synchronous  (with 
all  the  neurons  being  updated  simultaneously  at  each 
epoch)  or  asynchronous  (with  at  most  one  neuron 
being  updated  at  each  epoch).  In  the  application  of 
these  networks  to  associative  memory  both  modes 
of  operation  lead  to  very  similar  associative  behav¬ 
iour  (cf.  Psaltis  &  Venkatesh,  1989.  for  instance)  and 
we  will  not  make  a  distinction  in  this  paper  as  to  the 
precise  mode  of  operation. 

The  nature  of  flow  in  state  space  is  completely 
determined  once  the  neural  interconnection  strengths 
and  the  mode  of  operation  is  specified.  We  will  be 
interested  in  specifying  patterns  of  interconnectivity 
for  which  arbitrarily  prescribed  m-sets  of  memories 

u'" . o  '”’  £  B''  can  be  stored  in  the  network. 

In  order  for  the  network  to  act  as  art  associative 
memory,  we  require  that  the  memories  themselves 
be  stable  (i.e..  all  subsequent  operations  on  the 
memory  u'®'  give  back  u‘“')  Stable  memories  are 
hence  fixed  points  of  the  network.  Furthermore,  we 
require  states  close  to  any  of  the  memories  to  be 
mapped  into  the  memory  by  the  network.  This  is  the 
associative  or  error  correcting  feature  requisite  in  an 
associative  memory.  We  call  the  average  Hamming 
distance  from  a  memory  over  w  hich  such  error  cor¬ 
rection  is  exhibited  the  aiiraciion  radius  of  the  mem¬ 
ory. 

The  quadratic  Hamiltonian  (energy)  and  the  Man¬ 
hattan  form  have  been  shown  to  be  Lyapunov  func¬ 
tions  for  fully  connected  networks  with  symmetric 
connections  (Hopfield.  1982:  Goles  &  Vichniac.  1986; 
Peretto  &  Niez.  1986;  Psaltis  &  Venkatesh.  1989), 
hence,  guaranteeing  that  state  trajectories  of  such 
networks  will  terminate  in  stable  points.  If  the  neural 
interconnection  weights  are  chosen  so  that  the  de¬ 
sired  memories  are  stable,  then  the  existence  of  a 
Lyapunov  function  for  the  system  indicates  that  the 
memories  will  exhibit  an  attraction  radius  of  error 
correction.  The  outer  product  and  the  dual  spectral 
algorithms  lead  to  syn  ’trie  weights  but  this  is  not 
generally  true  for  the  s;.  ^ctral  scheme.  Nevertheless, 
the  spectral  scheme  also  exhibits  very  similar  attrac¬ 
tion  dynamics  (Psaltis  &  Venkatesh,  1989).  even 
though  there  is  no  known  Lyapunov  function  for  the 
general  case.  In  all  these  algorithms  stability  of  the 


stored  memories  can  be  assured  with  high  probability 
if  the  number  of  memories  is  within  the  storage  ca¬ 
pacity  of  the  algorithm  (McEliece  et  al.,  1987;  Psaltis 
&  Venkatesh,  1989).  The  existence  of  Lyapunov 
functions  then  guarantees  that  the  memories  (being 
fixed  points)  lie  at  the  minima  of  the  Lyapunov  func¬ 
tions. 

2.  ALGORITHMS 

2.1.  The  Spectral  Algorithm 

In  the  spectral  scheme,  the  interconnection  matrix 
W  is  defined  as  follows: 

W'  =  LAdJ'L)  'LL  (1) 

where  A  =  dg[A'" . is  the  m  x  m  diagonal 

matrix  of  positive  eigenvalues  . >  0, 

and  U  =  [u'"u'-’  u*'"']  is  the  n  x  m  matrix  of 

memory  column  vectors. 

We  note  that 

W‘L  =  L.\.  (2) 

where  u"*,  ....  u'""  are  the  eigenvectors  of  W  and 
A  is  the  spectrum  of  W'  (Venkatesh  &  Psaltis.  1985; 
Personnaz,  Guyon.  &  Dreyfus,  1985;  Venkatesh  &. 
Psaltis,  1989).  Therefore,  we  are  guaranteed  to  have 
stable  memories  as  iong  as  W'  is  well  defined. 

For  the  case  of  an  m-fold  degenerate  spectrum 

. /'"*  =  /.  >  0.  we  see  that  the  matrix  W 

is  symmetric  with  nonnegative  eigenvalues  (i.e..  it  is 
nonnegative  definite).  Therefore  there  exist  Lya¬ 
punov  functions  in  this  case,  and  moreover  it  has 
been  shown  that  the  stored  memories  form  global 
energy  minima  (Venkatesh  &  Psaltis,  1989). 

For  the  general  spectral  matrix  in  eqn  (1),  exact 
Lyapunov  functions  are  hard  to  come  by.  The  signal- 
to-noise  ratio,  however,  serves  as  a  good  ad  hoc  mea¬ 
sure  of  attraction  capability.  Consider  synchronous 
operations  with  W’  on  a  state  vector  u  =  u'“’  -t- 
<ju  G  B".  We  have 

W’u  =  \V’(u'°'  +  du)  =  W'u"”  +  W'du. 

Once  again,  there  exists  a  “signal"  term,  Wu'°'.  and 
a  “noise"  term,  W’^u.  We  anticipate  that  the  greater 
the  signal-to-noise  ratio,  the  greater  the  attraction 
around  u'“'.  Let  the  Hamming  distance  between  u 
and  u'“’.  d„(u,  u'“').  equal  d  (i.e.,  |Wu!l  =  2V^).  The 
(strong)  norm  of  the  matrix  W'  is  defined  as 

liW'xV 

|iW';:  =  sup  — — .  iix|:  /  0. 

•  I'X;. 

It  follows  (cf.  Strang,  1980)  that  ||W”1|  =  \1<.  where 
k  is  the  largest  eigenvalue  of  the  matrix  (W'j'^W. 

For  the  case  of  the  degenerate  spectrum  . 

=  /  >  10.  W’  is  symmetric,  and  (W'j'^VV  = 
(W')L  Therefore,  the  maximum  eigenvalue  of 
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(W‘)^W‘  =  k  =  /},  and  the  signal-to-noise  ratio 
(SNR)  is  given  by 

SNR  =  ‘ 

||W'(5u||  (Vk)(2\d)  2\d' 

Thus,  we  would  expect  the  attraction  sphere  around 
u*'*,  .  .  .  ,  u'""  to  increase  as  n  increases  for  the  m- 
fold  degenerate  spectral  scheme.  For  the  general 
.  nondegenerate  case,  we  expect  that  by  varying  the 
size  of  the  SNR,  and  hence  the  attraction  ca¬ 
pability,  be  proportionately  increased  or  decreased 
for  the  ath  memory  u‘“>  (Figure  1). 

Using  a  result  of  Komids  (1967)  we  can  show  that 
for  all  randomly  chosen  n-tuples  u'‘',  .  .  .  ,  u'"’  £ 
B",  and  m  s  n,  the  probability  that  W"  is  well  defined 
approaches  one  as  n  -»  «.  It  immediately  follows 
that  the  static  capacity  of  the  spectral  scheme  is  n, 
as  a  linear  transformation  has  at  most  n  eigenvalues. 

Let  N’  denote  the  number  of  elementary  opera¬ 
tions  required  to  compute  the  weight  matrix  W  di¬ 
rectly  from  the  m  memories  to  be  stored.  Then  using 
the  fact  that  (U’^U)"'  is  symmetric,  we  can  use  the 
Cholesky  decomposition  to  compute  its  inverse.  This 
along  with  the  rest  of  the  matrix  multiplications  gives 
us  that  N’  =  mn^  +  nrn  +  (m^)/2  +  0{n-)  (details 
can  be  found  in  Venkatesh  and  Psaltis  (1989)). 

2.2.  Dual  Spectral  Algorithms 

2.2.1.  Orthogonal  Spaces  and  Duality.  The  following 
scheme,  formally  related  to  the  outer  product  and 
spectral  algorithms,  was  introduced  by  Maruani  et 
al.  (1987). 

Let  U  =  •  •  •  u'"*]  be  the  matrix  of  memories 

as  before.  Let  P  =  1 . n  -  m,  be  a  set  of 


linearly  independent  vectors  in  R"  which  are  indi¬ 
vidually  orthogonal  to  each  of  the  memories  (i.e., 
X'^U  =  0,  where  we  define  the  n  x  (n  -  m)  matrix 
X  =  [x"’x'-’  •••  x'"' '"’)).  Define  a  weight  matrix  W 
with  weights  w,,  given  by 


where  is  the  /cth  component  of  x'^'.  If  we  define 
hi  -  '^1=1  i  =  1,  .  ..ri,  we  see  that 

W  =  IVI  -  XXL  (3) 

where  M  =  dgf/i, . p„\.  Thus, 

WU  =  MU  -  XXX' 

=  MU.  (4) 

Comparing  eqns  (2)  and  (4)  we  see  that  the  spectral 
and  dual  spectral  algorithms  exhibit  an  interesting 
duality.  Since  the  parameters  p,  are  positive  for  each 
choice  of  /,  it  follows  that 

A(Wu"),  =  A(/5,u,")  =  u% 

for  each  i  =  1 . n.  7=1 . m. 

So  the  memories  u'",  .  .  .  ,  u are  fixed  points  in 
the  scheme  as  well. 

W  as  defined  in  eqn  (3)  is  a  zero-diagonal  sym¬ 
metric  matrix.  Thus,  we  know  that  there  exists  some 
form  of  attrac  ^on  behaviour.  However,  since  the 
orthogonal  basis  X  has  been  chosen  arbitrarily,  there 
is  some  lack  of  control  in  specifying  attraction  ca¬ 
pability.  Specifically,  as  we  shall  argue  below,  the 
p's  essentially  control  directional  attraction  and  we 
have  no  means  of  specifying  these  under  the  above 


FIGURE  1.  Schematic  representation  of  the  attraction  space  In  the  spectral  scheme  for  memories  with  different  eigenvalues. 
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approach.  Our  goal  here  will  be  to  specify  an  algo¬ 
rithm  where  such  control  is  possible. 

2.2.2.  The  Effect  of  the  ^-values.  In  the  spectral 
scheme,  the  eigenvectors  of  W’  are  the  memories, 
so  that  the  column  space  of  W'  is  given  by  the  span 
of  the  memories.  Therefore,  if  the  memories  are  far 
enough  from  each  other  and  the  initial  state  vector 
u  is  close  enough  to  a  memory,  W’  combined  with 
the  thresholding  operation  projects  u  onto  the  mem¬ 
ory. 

On  the  other  hand,  in  the  dual  spectral  scheme, 
the  weight  matrix  is  obtained  by  taking  the  cor¬ 
relation  of  vectors  that  are  orthogonal  to  the  mem¬ 
ories  and  then  setting  the  diagonal  elements  to  be  0. 
In  creating  the  zero  diagonal,  we  essentially  add  per¬ 
turbations  to  the  left  nullspace  of  U  in  the  directions 
of  the  memories.  The  strength  of  the  perturbations 
along  any  component  i.  is  proportional  to  /},.  Thus, 
each  of  the  fi,'s  corresponds  to  a  directional  distor¬ 
tion,  and  we  expect  the  SNR  of  the  dual  spectral 
scheme  to  vary  from  direction  to  direction  propor¬ 
tionately  with  the  value  of  /},.  We  therefore  expect 
that  the  larger  the  /i, ,  more  information  is  lost  if  the 
;th  bit  is  flipped  and,  hence,  the  smaller  the  attraction 
would  be  in  the  /th  direction. 

As  an  illustration,  let  us  consider  the  case  where 
n  =  3.  and  p,  <  p,,  p.  (Figure  2).  Each  memory  u 
would  be  preferentially  attracted  in  the  dx-direction. 
indicated  schematically  by  an  attraction  cone  in  Fig¬ 
ure  2  (i.e.,  a  vector  with  a  different  x  component  will 
probably  map  back  to  u  but  vectors  with  different  y 
and  z  components  will  probably  not  be  within  the 


attraction  region  of  u).  In  other  words. 


2.2.3.  Specifying  Directional  Attraction  With  Linear 
Programming.  The  previous  section's  discussions  point 
to  a  necessity  of  somehow  specifying  the  //-values  if 
we  require  direction-specific  attraction.  Specifically, 
for  a  prescribed  set  //, . //„  >  0  of  directional 


attraction  strengths,  and  M  =  dg[//, . //„],  we 

require  a  weight  matrix  such  that 

W-'U  =  MU.  (5) 

We  define  W"*  such  that: 


^  i  lx,,b,)(x,fbi,)  if  i  ^  j 

"  [0  if  ;■  =  /,  '  ^ 

where  .v,/,  is  the  ith  component  of  the  basis  vector  x'^” 
as  defined  earlier,  and  is  the  ^th  component  of  a 
vector  which  we  wj||  specify  shortly.  Thus,  given 

//, . p„  we  need  to  find  a  vector  b  such  that 

with  Y  =  Xb 

W“  =  M  -  YY^  (7) 

(Note  that  the  columns  of  Y,  in  general,  are  not 
orthogonal.) 

Assuming  that  W'  has  the  form  given  in  eqn  (6). 
let  us  now  consider  the  effect  of  W*'  on  the  nh  ele- 


FIGURE  2.  Schematic  representation  of  the  directional  attraction  space  In  the  dual  spectral  scheme  tor  a  choice  of 
M.  M,.  p, 
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ment  of  a  memory  u'  "; 

[W'-u'-’l,  -  X 

I  =  I 

P  =  I 

=  Z  Z  blx.tx„ir'  +  X  blxiX-' 

,-\  (l-\  fi^\ 

=  Z  blxrfU'"'. 

8--  1 

We  require  from  eqn  (5)  that 

[W‘'u‘“'],  =  pm'”'- 

where  p,  >  0.  By  inspection,  we  obtain  the  relation¬ 
ship 

=  Z 

fi-  I 

Define  a,^  =  x;/,,  and  =  hjj.  Then  we  require 
Ac  =  M„, 

where  A  is  a  known  n  x  (n  -  w)  matrix  with  non- 
negative  elements  a,/,  =  x;^.  c  is  an  unknown  (n  - 
m)-dimensional  vector  with  =  bl  constrained  to 
be  nonnegative,  and  M„  is  a  specified  n-dimensional 

vector  with  positive  components  //, . //„. 

We  notice  that  this  is  an  overspecified  system  of 
n  equations  with  (n  -  m)  unknowns,  where  both  c 
and  M„  are  constrained  to  have  nonnegative  ele¬ 
ments.  Linear  programming  techniques  can  be  used 
to  solve  this  system  of  equations.  We  can  choose  the 
//-values  in  a  variety  of  ways.  Two  representative 
methods  are  suggested  here. 

Specifying  //,,...,//*,  /c  <  /i  -  m.  The  canonical 
form  of  the  linear  programming  problem  that  the 
simplex  method  solves  is: 

Minimize  the  goal  function  c^y  subject  to  the  con¬ 
straints 

Ay  -  b, 

where  the  vector  y  is  unknown,  and  y  >  0. 

In  this  case,  we  specify  k  positive  values  of 
and  minimize  the  maximum  of  the  {n  -  k)  unspec¬ 
ified  values  of  subject  to  the  constraints  //».i, 
...,//„>  0,  and  C|,  .  .  .  ,  >  0.  In  other  words, 

we  have  the  following  equations 

a,  iC,  +  ■■■  +  a, ,  ,„c„  „  =  //, 

a.  |C,  +  +  a, ,  ,c,.„  =  //, 

o..i.iC,  +  •••  +  „  s  £ 

a.,|C,  +  ■■■  +  ^  s  £, 
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where  c,  >  0,  £  >  0,  and  we  want  to  find  c  which 
minimises  e. 

To  convert  the  n  -  k  inequalities  to  equalities, 
we  subtract  e  from  both  sides  of  the  equation  and 
add  slack  variables  z, . to  give  us  the  fol¬ 

lowing  n  —  k  equations 

-t-  +  a..,,  ..c,  „  -  £  -I-  z,  =0 

a,  ,c,  -I-  a,..„c„  „  -  £  -t-  z,.»  =  0, 

in  addition  to  the  first  k  equations.  Now  we  have  n 
equations  with  2n  -  m  -  k  unknown  nonnegative 
quantities  (c, . z, . z„_*). 

Let  us  label  e  as  Co.  By  inspection,  we  see  that  the 
goal  function  to  be  minimised  is  c,).  subject  to  the 
constraints  A'c'  =  M^,  where  c'  is  a  (In  -  m  - 
k  +  l)-dimensional  vector  is  a  /i-dimensional 
vector,  and  A'  is  an  «  by  2ai  -  m  -  -i-  1  matrix; 

that  is,  we  require  to  solve 


/  0  0 

'  ftt ' 

0  A 

0 

: 

f' 

Pk 

-1  1 

^  n  m 

^  1 

0 

: 

\-i 

1/ 

\  6 

and  c,,  z,  >  0.  This  is  in  the  canonical  form  for  the 
simplex  method. 

Specifying  //, . In  this  case,  we  specify  all 

the  values  of  IVI„.  We  indicate  two  possible  options 
when  solving  for  c. 

1.  Minimise  the  mean-square  error  given  by 

||Ac  -  MJI’  =  X  (“'Ti  +  -  //,)- 

I 

subject  to  the  constraints  >  0,  c  >  0. 

This  is  a  quadratic  programming  problem. 
However,  this  problem  can  be  reformulated  as  a 
simplex  method  problem  and  can  be  solved  using 
a  variation  of  the  traditional  simplex  method  called 
Wolfe’s  method  (Wolfe,  1959). 

2.  Minimise  the  largest  absolute  error  Cq,  given  by 

max(|£,| . |£,|) 

where  £,,  the  error  in  //,.  is 

p,  -  (a,,c,  +  +  a,„  1  =  1,...,  n. 

Our  problem  now  is  to  minimize  Cq  subject  to  Co. 
c, . r„  „,  s  0.  To  solve  this  problem,'  we 


'  This  is  known  as  Chebj’shev's  Approximation  (Franklin.  1980, 
P  8). 
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note  that  we  have  n  pairs  of  inequality  contraints 
of  the  form 

"Co  +  a,,,c,  +  <  fi„ 

-Co  -  a,|C,  -  -  a, <  -fi,. 

The  addition  of  slack  variables  puts  the  problem 
in  canonical  form. 

2.2.4.  Characterisation  of  the  Dual  Spectral  Scheme. 
For  simplicity,  we  consider  algorithms  employing  the 
first  linear  programming  approach  outlined  above. 
We  have  modified  the  initial  basis  for  the  nullspace 
of  U  using  the  results  of  the  simplex  method  such 
that 

W''  =  M  -  YV, 

where  M  =  dg[/q,  //, . with  /r, . //*  > 

0  specified  by  us,  and  0  <  /i*.,,  ...,//„<£< 
niin(/r,,.  .  .,^*),andY  =  Xb  is  a  set  of  basis  vectors 
for  the  left  nullspace  of  U.  Since  p,,  i  =  1,  .  .  .  ,  n, 
are  positive,  we  see  that  all  the  memories  are  strictly 
stable  in  the  dual  spectral  scheme  as  long  as  the 
memories  u'",  .  .  .  ,  u*""  are  linearly  independent, 
and  we  are  able  to  find  the  vector  c  in  the  system 
(8)  through  linear  programming. 

As  asserted  earlier,  since  W*'  is  a  symmetric,  zero- 
diagonal  matrix,  there  exist  Lyapunov  functions  for 
this  scheme  in  both  modes  of  operation.  We  have 
also  conjectured  that  the  attraction  is  directional  in 
nature.  The  storage  capacity  of  the  dual  spectral 
scheme  of  eqn  (6)  is  directly  n  -  1.  Specifically 
n  -  1  is  the  number  of  memories  for  which  we  can 
still  specify  a  left  nullspace  X.  (By  Komlos'  result 
(Komids.  1967),  we  are  guarant«ed  that  almost  all 
choices  of  n  memories  or  fewer  are  linearly  inde¬ 
pendent,  so  that  for  almost  all  choices  of  n  -  1 
memories  there  is  an  orthogonal  subspace  of  dimen¬ 
sion  1.  while  almost  all  choices  of  n  memories  span 
the  space  R"  and  therefore  the  orthogonal  subspace 
is  of  dimension  0.) 

To  find  an  n-dimensional  vector  under  constraints, 
the  simplex  method  iterates  from  one  feasible  solu¬ 
tion  to  another  until  it  finds  an  optimal  feasible  so¬ 
lution.  The  maximum  number  of  iterations  that  the 
simplex  method  can  go  through  to  find  an  «-dimen- 
sional  vector  is  2"  -  1.-  However,  it  has  been  widely 
reported  (Chvatal,  1983;  Murty,  1983)  that,  in  prac¬ 
tice,  the  number  of  iterations  is  almost  always  be¬ 
tween  1  to  3  times  the  number  of  constraints.  Thus, 
for  the  case  of  specifying  k  values  of  M^,  we  would 
expect  at  the  most  3n  iterations.  The  computational 
complexity  of  each  iteration  is  dependant  on  how  the 
simplex  method  is  implemented.  For  the  revised  sim- 


•  This  happens  when  the  simplex  method  tests  each  vertex  of 
the  fi-sided  p<  .dron  that  bounds  the  feasible  region 


plex  method,  a  good  estimate  of  the  average  cost 
of  each  iteration  in  our  scheme  is  52n  -  10m  - 
lOit  +  10,  while  for  the  standard  simplex  method,  a 
good  estimate  is  (2n’  -  mn  ~  kn  +  n)/4  (cf.  Chva¬ 
tal,  1983.  p.  113).  Thus,  we  estimate  that  the  total 
cost  of  specifying  k  values  of  is  O(n’)  (using  the 
revised  simplex  method).  The  cost  of  finding  a  basis 
for  the  nullspace  of  U  (through  Gram-Schmidt  or- 
thogonalisation)  includes  finding  (U^U)"'  and  two 
other  matrix  multiplications  and  is  given  by  mn- 
(m-n)l2  -  mV2  -i-  O(n-).  Finally,  the  cost  of  finding 
W''  from  c  and  X  is  /i’  -  n’m  -i-  0{n-).  So,  we  can 
say  that  on  the  average, 

N'‘  =  n'  +  im-n  -  mn-  -  nt'/2  +  O(n-), 

where  N'‘  is  the  number  of  elementary  operations 
needed  to  compute  W"'. 

There  are  a  number  of  open  questions  involved 
with  the  dual  spectral  scheme  arising  from  the  nature 
of  the  construction  of  the  W  matrix.  The  number 
of  directions  k,  that  can  be  specified  given  a  set  of 
m  memories  and  n  neurons  is  of  interest.  It  is  obvious 
from  the  previous  discussion  about  the  dimensions 
of  A  and  c.  that  we  can  surely  specify  no  more  than 
n  -  m  directions.  However,  there  is  a  possibility 
(albeit  small)  that  there  exist  no  feasible  solutions 
for  pathological  cases  where  k<n  -  m.  This  is  seen 
particularly  when  the  number  n  -  m  is  very  small. 
Another  quantity  we  are  interested  in  is  the  size  of 
£.  the  largest  of  the  unspecified  compared  to  the 
size  of  the  specified  n'%  since  we  have  conjectured 
that  this  will  affect  directional  attraction. 

While  there  exists  little  theory  for  the  simplex 
method  which  will  enable  us  to  gauge  these  param¬ 
eters.  simulations  show  that  c  is  typically  small  com¬ 
pared  to  p,  for  the  specified  directions  (<0.5/(,),  and 
k  is  typically  of  the  order  of  n/4  in  the  ranges  sim¬ 
ulated.  We  conjecture  that  this  behaviour  continues 
to  hold  for  large  n. 

2.3.  Composite  Algorithms 

In  section  2  1  we  saw  ways  of  increasing  the  radii  of 
attraction-spheres  around  memones.  In  section  2.2 
we  say  ways  of  specifying  increased  attraction  in  cer¬ 
tain  directions  around  each  of  the  memories.  A  nat¬ 
ural  extension  of  these  schemes  is  to  create  a  com¬ 
posite  scheme  with  weight  matrix  W'  given  by 

W'  =  W  -1-  W'*. 

Since  W'  is  a  linear  combination  of  W'  and  W''. 
we  would  expect  memories  to  be  stable  in  the  com¬ 
posite  scheme  for  reasons  described  in  the  previous 
sections.  The  idea  of  the  composite  scheme  is  to 
specify  both  memory-specific  attraction  by  specifying 
/  for  each  memory,  and  direction-specific  attraction 
by  specifying  p  for  the  individual  directions  (Fig¬ 
ure  3). 
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FIGURE  3.  Schematic  representation  of  the  joint  memory-specific  and  direction-specific  attraction  space  for  two  memories 
in  the  composite  scheme. 


Here,  the  spectrum  of  W'  is  no  longer  degenerate, 
and  consequently,  is  no  longer  symmetric.  As 
the  composite  algorithm  combines  the  memory-spe¬ 
cific  spectral  algorithm,  and  the  direction-specific 
dual  spectral  algorithm,  it  works  effectively  in  shap¬ 
ing  the  attraction  regions  as  desired.  It  should  be 

noted  that  the  relative  values  of  the  . 

compared  to  the  //; . p„.  need  to  be  considered 

in  order  not  to  lose  the  effects  of  one  of  the  two  parts 
of  the  composite  scheme. 

Note  that  the  capacity  of  the  composite  scheme  is 
n  -  1.  The  algorithm  complexity  of  the  composite 
scheme  is  the  sum  of  the  complexities  of  the  spectral 
and  dual  spectral  schemes,  except  that  we  need  not 
find  twice.  Therefore  the  complexity  is 

given  by  3/i’  -f-  0{n-)  for  m  ts  n. 


3.  SIMULATIONS 

Computer  simulations  were  carried  out  to  verify  the 
behaviour  of  the  various  schemes.  Systems  with  state 
vectors  of  32  bits  were  considered  in  the  simulations. 
The  memories  were  chosen  randomly  with  a  binomial 
pseudo-random  number  generator  with  equiproba- 
ble  values  1  and  -  1.  For  each  size  of  memory  set  m 
that  was  investigated,  simulations  were  carried  out 
for  each  of  the  schemes,  and  the  behaviour  of  the 
schemes  was  averaged  out  over  between  20  and  100 
trials,  where  over  each  trial  a  different  random  set 
of  memories  was  generated.  Error  correction  data 
were  compiled  at  each  trial  by  testing  the  conver¬ 
gence  of  randomly  generated  probes  at  increasing 
Hamming  distance  from  a  memory.  Attraction  radii 


were  estimated  by  averaging  the  maximum  error  cor¬ 
rection  radius  for  each  trial  over  the  number  of  trials. 
The  graphs  included  here  were  obtained  from  syn¬ 
chronous  mode  operations.  However,  we  found  that 
the  schemes  essentially  behaved  the  same  under  an 
asynchronous  mode  of  operation.  The  graphs  show 
typical  stability  and  attraction  behaviour  in  each  of 
the  schemes.  We  can  extract  information  on  expected 
worst  and  best  case  behaviour  for  a  set  of  random 
memories  from  these  curves. 

The  behaviour  of  the  outer  product  scheme  is 
highlighted  in  Figures  4  and  5.  As  anticipated,  the 


m 

FIGURE  4.  The  percentage  of  stable  memories  plotted  against 
the  number  of  memories  m  In  the  outer  product  scheme  when 
n  =  32. 
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FIGURE  5.  The  average  radius  of  attraction  around  a  stable 
memory  is  plotted  versus  the  number  of  memories  for  n  = 
32  In  the  outer-product  scheme.  The  attraction  radius  is  es¬ 
timated  by  averaging  the  maximum  Hamming  distance  of 
error-correction  around  a  stable  memory  over  several  In- 
dependent  runs. 
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FIGURE  6.  The  attraction  radius  around  a  typical  memory 
plotted  as  a  function  of  the  number  of  memories  m,  In  the 
degenerate  spectral  scheme  where  all  the  eigenvalues  are 
chosen  equal  to  A  =  n  =  32.  Estimates  of  the  attraction 
radius  for  a  given  number  of  memories  were  again  obtained 
by  averaging  the  maximum  distance  of  error-correction  around 
a  memory  over  several  Independent  runs. 


number  of  stable  memories  declines  precipitously  as 
m  increases  beyond  a  certain  point  (the  static  ca¬ 
pacity)  as  seen  in  Figure  4.  While  n  is  quite  small  in 
these  examples,  the  figures  nonetheless  are  a  pre¬ 
cursor  of  the  0-1  behaviour  which  develops  around 
the  static  capacity  of  «/(4  log  n)  for  large  n  (Ven¬ 
katesh.  1986;  McEliece  et  al.,  1987;  Komids  &  Pa- 
turi,  1988),  Figure  5  shows  the  graceful  degradation 
of  the  average  Hamming  radius  of  attraction  around 
the  memories  as  the  number  of  stored  memories  in¬ 
creases.  (We  averaged  the  maximum  attraction  ra¬ 
dius  for  each  of  the  memories  over  several  indepen¬ 
dent  trials  to  obtain  estimates  of  the  average  radius 
of  attraction.)  The  analysis  in  McEliece  et  al.  (1987) 
indicates  that  the  attraction  is  neither  memory-  nor 
direction-specific,  and  that  we  obtain  uniform  Ham¬ 
ming  bails  of  attraction  around  each  memory  with 
high  probability  for  large  n. 

Simulations  highlighting  the  behaviour  of  the 
spectral  scheme  as  a  viable  algorithm  for  associative 
memory  are  presented  in  Figures  6  and  7.  The  av¬ 
erage  Hamming  radius  of  attraction  again  degrades 
gracefully  as  the  number  of  memories  increases,  as 
illustrated  in  Figure  6.  where  the  degenerate  spectral 
algorithm  exhibits  uniform  balls  of  attraction  around 
the  memories.  (The  static  capacity  here  is  clearly  n 
as  outlined  before  and  verified  in  our  simulation.) 
As  can  be  seen,  the  dynamical  behaviour  of  the  spec¬ 
tral  scheme  is  qualitatively  similar  to  the  outer  prod¬ 
uct  scheme,  but  somewhat  better  over  all  ranges. 

Investigations  into  attraction  dynamics  in  the 
spectral  scheme  when  there  is  a  large  deviation  in 
eigenvalue  size  confirm  theoretical  predictions  that 


the  sizes  of  the  attraction  basins  are  memory -specific 
and  increase  with  increase  in  the  eigenvalue  size  of 
the  corresponding  memory.  These  trends  are  ex¬ 
emplified  in  the  typical  plot  of  Figure  7  where  half 
the  eigenvalues  are  fixed  arbitrarily  at  n.  and  the 
other  half  of  the  eigenvalues  are  fixed  at  a  fraction 
of  n.  The  plots  show  the  relative  sizes  of  the  Ham¬ 
ming  balls  of  attraction  for  memories  with  large  ei¬ 
genvalue  as  compared  to  memories  with  small  ei¬ 
genvalue.  as  a  function  of  the  ratio  of  the  two 
eigenvalues.  The  results  are  similar  for  other  values 
of  m  in  the  range  of  interest  (i.e..  values  of  m  for 
which  there  is  significant  attraction:  the  attraction 
radii  around  the  memories  is  proportional  to  corre¬ 
sponding  eigenvalue  size). 

The  feasibility  of  forming  the  dual  spectral  matrix 

W''.  using  the  simplex  method  when  /r, . /r*  are 

specified  is  confirmed  in  Figures  8  and  9.  The  success 
rate  (the  percentage  of  trials  when  the  simplex  method 
returns  a  feasible  solution  with  e  <  min(/^,,  .  .  .  , 
fi„))  is  plotted  in  Figure  8  against  the  number  of 
memories  m,  averaged  over  various  choices  of  k.  In 
Figure  9,  the  success  rate  is  plotted  as  a  function  of 
the  number  of  specified  directions  k.  with  m  as  a 
parameter.  Note  that  the  success  rate  is  almost  100% 
when  k  is  small,  and  drops  gradually  with  failures 
occurring  most  often  when  k  approaches  n  ~  m  (Fig¬ 
ure  9).  Figure  10  exhibits  plots  of  average  c  versus 
k  for  various  m.  As  can  be  seen,  e  increases  with 
increasing  k  and  increasing  m.  Exhaustive  simula¬ 
tions  indicate  that  the  values  of  c  obtained  by  the 
simplex  algorithm  for  n  =  16  (fixed  m,  k)  are  ap¬ 
proximately  twice  those  for  n  =  32.  Since  the 
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lambda  (small)  /  lambda  (large) 

FIGURE  7.  Demonstration  of  memory-specific  attraction  In 
the  spectral  scheme  for  n  =  32  and  m  =  6.  The  memories 
were  divided  Into  two  equal  sized  groups,  one  group  with 
eigenvalue  A(large)  =  n,  and  the  other  group  with  eigenvalue 
A(smalt)  varying  as  a  fraction  of  n.  The  respective  attraction 
radii  of  the  A(large)  memories  and  the  A( small)  memories  are 
plotted  as  the  ratio  A(small)'A(targe)  Is  increased  from  zero 
to  one. 

dynamic  attraction  behaviour  of  the  dual  spectral 
scheme  is  dependent  on  the  size  of  e,  these  curves 
are  crude  indicators  of  the  limits  on  m  and  k  in  the 
dual  spectral  scheme. 

Investigations  into  the  attraction  dynamics  of  the 


FIGURE  8.  The  percentage  of  trials  when  the  simplex  method 
returns  a  feasible  solution  (the  success  rate)  In  forming  the 
dual  spectral  matrix  W*,  averaged  over  various  choices  of  Hr, 
the  number  of  specified  directional  values,  p,,...,  p,.  plot¬ 
ted  as  a  function  of  the  number  of  memories  m,  when  n  = 
32. 


FIGURE  9.  The  percentage  of  trials  when  the  simplex  method 
returns  a  feasible  solution  for  the  dual  spectral  scheme  (the 
success  rate)  plotted  as  a  function  of  the  number  of  specified 
directions  k,  for  a  choice  of  n  =  32,  m  ~  13.  (Here  k  denotes 

the  number  of  directional  values,  p, . .  specified  In  the 

algorithm.) 

dual  spectral  scheme  verify  the  analylical  predictions 
of  its  performance  as  an  associative  memory.  We  will 
use  the  measure  of  attraction  in  a  particular  direction 
X  for  a  particular  memory  to  be  the  average  Ham¬ 
ming  radius  from  which  state  vectors  converge  to  that 
memory  when  bit  x  is  kept  flipped.  (Specifically,  if 
mu,  is  large,  then  inputs  with  bit  x  opposite  in  sign 
to  a  memory  will  be  unlikely  to  converge  to  the  mem¬ 
ory,  and  conversely  if  p,  is  small.  Equivalently,  if  bit 


FIGURE  1 0.  The  ratio  of  the  largest  value  c,  of  the  unspecified 
directional  parametera,  p,,„  ,  p.,  to  the  smallest  of  the 

specified  directional  parameters  p,^  =  mlnjin.  ..  ./>>}.  plot¬ 
ted  versus  k,  the  number  of  specified  directional  parameters 
with  m  as  a  parameter  for  n  =  32. 
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X  of  the  input  vector  is  constrained  to  be  correctly 
matched  to  the  corresponding  bit  of  the  memory, 
then  the  algorithm  will  tend  to  correct  for  rather  large 
distortions  in  the  other  components  if //j  is  large,  and 
conversely  if  is  small.)  Figure  11  exhibits  plots  of 
average  attraction  in  both  the  specified  (important) 
and  the  unspecified  (uniiii>;ortant)  directions,  where 
the  component  of  the  input  in  the  direction  being 
investigated  was  initially  kept  flipped.  Here,  the  at¬ 
traction  characteristics  have  been  averaged  over  all 
the  memories  for  the  two  cases:  (1)  the  specified 
directions  (corresponding  to  large  values  of  pt),  and 
(2)  the  unspecified  directions  (corresponding  to  small 
values  of  /u).  As  can  be  seen,  there  exists  a  consistent 
difference  in  attraction  in  the  large  p  and  small 
directions  when  k  is  small,  with  a  merging  of  the 
attraction  capabilities  for  larger  k. 

The  simulations  indicate  that  we  do  have  the  ca¬ 
pability  of  separately  achieving  memory-specific  and 
direction-specific  attraction.  Investigations  into  the 
composite  scheme  indicate  that  attraction  basins  can 
indeed  be  shaped  over  a  wide  range.  Varying  the 
values  of  the  specified  ;</s,  and  large  eigenvalues  (/i^i) 
and  small  eigenvalues  (/.^)  lead  to  attraction  basins 
that  range  from  being  purely  directional  to  com- 


FIQURE 11.  Oemonstratlon  of  direction-specific  attraction  In 
the  duel  spectral  scheme  for  n  =  32  and  m  =  S.  Curves  of 
attraction  radii  versus  the  number  of  specified  directional 
parameters  A,  are  shown  for  two  different  directions— a 
specified  (large  direction  and  an  unspecified  (small  ft) 
direction.  Attraction  data  for  a  given  direction  were  gener¬ 
ated  by  Investigating  probe  vectors  at  various  Hamming  dis¬ 
tances  from  a  memory  with  the  component  of  the  probe  In 
the  direction  being  Investigated  being  chosen  to  be  opposite 
In  sign  to  the  corresponding  component  of  the  memory, 
(Flipping  a  bit  In  an  Important  (large  p)  direction  would  re¬ 
duce  the  attraction  to  the  memory  compared  to  an  unim¬ 
portant  (smalt  It)  direction.) 
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RGURE  12.  Demonstration  of  memory-specific  attraction  In 
the  composite  scheme  tor  n  =  32.  The  memories  are  divided 
Into  two  groups,  one  group  corresponding  to  a  "large"  ei¬ 
genvalue  =  3,  and  the  other  group  corresponding  to  a 
“smatl”  eigenvalue  A,.  =  1.  Attraction  radii  tor  a  memory 
are  plotted  as  a  function  of  the  number  of  memories  m  tor 
the  two  cases  of  the  memory  corresponding  to  eigenvalues 
Ai,  =  3  and  ==  1. 

pletely  "spherical''  around  memories.  A  sample  case 
where  =  I.  /.|f  =  3.  and  =  6  (which  gives  us 
c  £  3  for  moderate  values  of  k  and  m)  is  shown  in 
Figures  12  and  13.  Figure  12  exhibits  plots  of  mem- 


FIGURE  13.  Oemonstratlon  of  direction-specific  attraction  In 
the  composite  scheme  tor  n  =  32  and  m  =  6.  Directional 
attraction  parameters  are  specified  to  be  all  equal  to  fi,  = 
6,  while  the  largest  of  the  unspecified  directional  parameters 
is  kept  below  <  =  3.  Attraction  radii  are  plotted  In  the  large 
ft  (specified)  and  small  p  (unspecified)  directions  as  a  func¬ 
tion  of  A,  the  number  of  specified  directions. 
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FIGURE  1 4.  An  overview  of  the  features  of  the  family  of  spec¬ 
tral  algorithms— the  outer  product  (pseudo-spectral)  algo¬ 
rithm,  the  spectral  algorithm,  the  dual  spectral  algorithm,  and 
the  composite  algorithm. 

ory-specific  attraction  against  the  number  of  mem¬ 
ories.  As  can  be  seen,  there  is  a  superiority  of  be¬ 
tween  6  to  8  Hamming  bits  of  attraction  for  memories 
with  large  eigenvalues  as  opposed  to  memories  with 
small  eigenvalues.  (For  n  =  16  we  obtain  a  superi¬ 
ority  of  between  2  to  3  Hamming  bits  of  attraction 
for  the  same  choice  of  ma.ximum  and  minimum  ei¬ 
genvalues.)  Direction-specific  attraction  is  mapped 
in  Figure  13.  As  seen,  we  obtain  a  direction-speci¬ 
ficity  of  about  4  bits  in  attraction  capability  when 
comparing  the  strong  and  weak  directions.  (For  n  = 
16  we  perceive  a  2  to  3  bit  difference  in  attraction 
capability  between  specified  and  unspecified  direc¬ 
tions  for  small  values  of  k.  When  the  number  of 
memories  m  is  very  small,  however,  only  marginal 
direction-specificity  is  displayed.)  We  stress  once 
again  that  by  increasing  the  value  of  the  specified 
//’s,  we  increase  direction-specific  attraction  at  the 
expense  of  memory-specific  attraction. 

Figure  14  summarises  the  main  features  of  the 
three  algorithms,  and  highlights  their  relationship 
with  the  spectral  algorithm;  in  particular,  the  pseudo- 
spectral  nature  of  the  outer-product  algorithm,  and 
the  dual  spectral  nature  of  the  dual  spectral  algorithm 
is  emphasised. 
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1  INTRODUCTION 

In  this  chapter  we  describe  models  of  autoassociative  memory  based  upon  densely  intercon¬ 
nected  recurrent  networks  of  McCulloch-Pitts  neurons  (McCulloch  and  Pitts,  1943).  The 
model  neurons  in  these  networks  are  linear  threshold  elements  which  compute  the  sign  of 
a  linear  form  of  their  inputs.  In  a  recurrent  network,  a  collection  of  these  neurons  com¬ 
municate  with  each  other  through  linear  synaptic  weights  and  each  neuron  changes  state 
based  on  the  net  synaptic  potential  from  all  the  neurons  in  the  network.  The  instantaneous 
state  of  the  neural  network  is  described  by  the  collective  states  of  the  individual  neurons, 
and  the  choice  of  synaptic  weights  and  the  neuron  updating  rule  determines  the  nature  of 
flow  in  the  state  space  of  the  network.  As  in  any  dynamical  system,  the  fixed  points  of  the 
network  play  a  critical  role  in  determining  its  computational  properties.  In  particular,  such 
networks  can  be  used  for  encoding  a  set  of  prescribed  items  identified  with  states  u  as  fixed 

points  of  the  network,  i.e.,  states  u  which  are  fixed  under  the  dynamics  of  the  network.  In 

"This  work  was  supported  in  part  by  the  Air  Force  Office  of  Scientific  Reseach  under  grant  AFOSR 
89-0523. 
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addition,  some  form  of  error  correction  is  desired,  so  that  states  u  +  in  the  vicinity  of 
u  are  mapped  into  u.  When  the  item  is  to  be  retrieved  from  the  network,  the  search  can 
then  be  done  on  either  the  whole  item  or  on  part  of  it.  Such  searches  by  data  association 
are  especially  useful  in  pattern  recognition  applications  such  as  speech  and  vision.  We  will 
describe  methods  of  specifying  in  a  probabilistic  sense  the  conditions  under  which  such  error 
correction  occurs. 

We  now  consider  a  fully  interconnected  network  of  n  McCuUoch-Pitts  neurons.  Each 
neuron  is  capable  of  assuming  two  values:  1  (firing)  and  -1  (not  firing).  The  instantaneous 
binary  output  of  each  neuron  is  fed  back  as  an  input  to  the  network.  At  epoch  t,  if 
U2[t],  . . . ,  Un[<l  are  the  outputs  of  each  of  the  n  neurons  in  the  network,  then  at  epoch  t  +  1, 
the  ith  neuron  updates  itself  according  to  the  following  threshold  rule: 


+  1]  =  , 


where 

A/\j“l“lif3:^v  /i\ 

=  {  -I 

Based  on  the  preceding  firing  rule,  each  neuron  is  characterised  by  a  set  of  n  real  synaptic 


weights,  and  a  real  threshold  value,  and  the  network  as  a  whole  is  characterised  by  a  matrix 
of  weights  Wij  and  n  thresholds  m;,o.  Without  loss  of  generality,  we  can  confine  our 
analysis  to  zero  thresholds  as  thresholds  are  easily  subsumed  by  the  simple  expedient  of 
adding  a  constant  input  of  -I  to  each  neuron. 

The  instantaneous  state  of  the  network  is  an  n-tuple  u  =  (ui, . . . ,  u„)  6  B",  where  B  = 


{-1,  1},  and  u  he  ith  component  of  u  is  the  output  value  of  the  ith  neuron.  Our  goal  is  to 
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encode  items  (states  in  B”)  that  are  to  be  stored  as  fixed  points  of  the  recurrent  network  by 
aji  appropriate  choice  of  weight  matrix.  We  label  these  prescribed  items  . . .  ,  and 
hereafter  refer  to  them  as  memories  to  distinguish  them  from  other  states  of  the  network. 

1.1  Lyapunov  Functions  and  Error  Correction 

The  network  can  operate  under  different  modes  of  updating.  K  all  neurons  are  simultane¬ 
ously  updated,  the  mode  of  operation  is  said  to  be  synchronous.  If  at  most  one  neuron  is 
updated  at  each  epoch,  the  network  is  said  to  operate  in  an  asynchronous  mode.  It  has  been 
shown  that  both  modes  of  operation  lead  to  very  similar  associative  behaviour  in  neural 
networks.  Note  that  in  this  synchronous  mode  case,  we  have 

u[t  +  1]  =  A(Wu[t]) 

where  A  :  IR"  — *  B"  is  an  n-ary  pointwise  threshold  operator  whose  »th  component  A,(i) 
is  as  in  (1). 

Given  the  arbitrarily  prescribed  set  of  m  memories,  we  are  interested  in  specifying 
patterns  of  interconnectivity.  The  nature  of  flow  in  state  space  of  the  network  is  completely 
determined  once  the  matrix  of  neural  interconnection  weights  is  computed.  In  order  for 
our  network  to  act  as  an  associative  memory,  we  require  that  the  memories  be  stable. 
As  described  earlier,  a  memory  is  stable  if  all  subsequent  mappings  return  i.e., 
-b  1]  =  A(Wu^“)[t])  for  all  t.  Furthermore,  we  required  that  the  memories  exercise 
a  region  of  influence  around  themselves,  i.e.,  states  close  to  or  similar  to  memories  should 


map  to  the  corresponding  memories  in  the  network,  and  thereby  exhibit  error  correcting 
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properties.  The  Euclidean  distance  between  a  state  u  and  a  memory  is  given  by 

ll/2 


d£(u,u(“))  = 


.i=l 


In  B",  —  «i)  =  ±2  if  the  components  are  mismatched,  and  0  otherwise  .  Therefore 

dE  =  2\/3^  where  the  Hamming  distance  is  defined  to  be  the  number  of  mis¬ 

matched  components  between  the  two  states.  We  shall  use  the  average  Hamming  distance 
from  a  memory  over  which  error  corrections  is  exhibited  as  a  natural  measure  of  attraction, 
and  call  it  the  attraction  radius,  p,  of  the  memory. 

From  the  theory  of  dynamical  systems,  we  can  expect  attraction  behaviour  in  neu¬ 
ral  networks  if  we  can  find  functions  on  the  systems  that  are  bounded  and  monotone 
non-increasing  along  trajectories  in  the  state  space.  Such  functions  are  called  Lyapunov 
functions.  If  such  a  function  exists,  the  stable  points  of  the  system  reside  at  minimas  of 
the  function.  If  the  memories  are  programmed  to  be  at  these  minimas,  we  can  achieve  the 
desired  attraction  behaviour  around  the  memories. 

Two  such  Lyapunov  functions  are  the  Hamiltonian  Energy  (£'(u))  function  and  the 
Manhattan  Norm  (F'(u))  function,  where 


£(u)  = 


t=i  j=i 


and 


1=1 


7  =  1 


These  functions  act  as  Lyapunov  functions  for  certain  classes  of  weight  matrices  under 


particular  modes  of  operation  (cf.  Hopfield,  1982;  Goles  and  Vichniac,  1986;  Venkatesh  and 
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Psaltis,  1989).^ 

Proposition  1  £(u)  is  non-increasing  in  asynchronous  mode  ifW  is  symmetric  and  has 
non-negative  diagonal  elements;  E(u)  is  non-increasing  in  any  mode  if  W  is  symmetric 
and  non-negative  definite. 

Proposition  2  f’(u)  is  non-increasing  in  synchronous  mode  ifV/  is  symmetric. 

While  the  existence  of  Lyapunov  functions  indicates  attraction  behaviour,  the  lack  of 
one  does  not  necessarily  indicate  that  the  network  will  not  function  as  desired.  In  the 
following  sections,  we  shall  discuss  a  family  of  near-optimal  algorithms  to  compute  the 
weight  matrix  W.  All  but  one  of  these  algorithms  results  in  a  symmetric  weight  matrix, 
and  all  of  them  exhibit  both  desired  properties  of  stability  and  attraction.  In  the  following 
sections,  we  will  establish  the  relationship  between  the  various  algorithms,  arid  evaluate 
their  performance. 

1.2  Capacity  and  Complexity 

Two  measures  characteristic  of  any  algorithm  are  the  algorithmic  capacity  and  algorithmic 
complexity.  We  will  look  at  these  measures  for  each  of  the  alogrithms  in  the  succeeding 
sections. 

Capacity  is  the  maximal  number  of  memories  that  can  be  stored  with  high  probability. 

It  is  useful  to  define  capacity  as  a  rate  of  growth  rather  than  an  exact  number.  Spcrifi  rally, 

a  sequence  of  numbers  {C{n),n  >  1}  is  a  sequence  of  capacities  if  and  only  if  for  every 
'  Proofs  of  propositions  in  the  main  text  of  the  chapter  are  deferred  to  Appendix  A. 
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A  €  (0, 1),  <is  n  — ►  oo  the  probability  that  each  of  the  memories  is  stable  approaches  1 
whenever  m  <  (1  —  A)C(n),  and  approaches  0  whenever  m  >  (1  +  A)C’(n).  We  note  that 
a  consequence  of  this  is  that  a  sequence  of  capacities,  if  it  exists,  is  not  unique,  but  rather 
determines  an  equivalence  class  of  sequences  C  where,  C(n)  and  C'{n)  are  sequences  in  C 
iff  C(ti)  ~  C^n)  as  n  -*  oo.^ 

We  define  complexity  to  be  the  number  of  elementary  operations  required  to  compute 
the  matrix  of  weights.  For  the  purposes  of  this  discussion,  the  elementary  operations  are 
multiplication  and  addition  of  two  real  values.  We  are  therefore  interested  in  determining 
the  complexity  of  an  alogrithm  given  m  memories  in  n-space.  In  some  cases,  an  algorithm 
may  have  a  recursive  definition  which  may  result  in  a  reduced  algorithmic  complexity  from 
a  practical  standpoint  when  memories  are  added  to  the  network  one  at  a  time. 

2  OUTER  PRODUCT  ALGORITHM 

Let  . . .  ,  be  a  selection  of  m  memories.  The  Outer  Product  Algorithm  prescribes 
that  the  weight  matrix  be  chosen  in  tlie  following  manner  (see  Chapter  1  for  additional 
details): 

W°P  =  UU^  ,  (2) 

where  U  = 

This  scheme  uses  the  sum  of  outer-products  of  the  memory  vectors  as  correlation  be¬ 
tween  the  memories  to  form  a  weight  matrix  so  that  an  input  vector  close  to  a  mf’mory 

^On  asymptotic  notation.  If  {i(n))  and  (j/fn)}  are  any  two  sequences,  we  denote:  i„  =  0(y„)  if  there 
exists  a  constant  K  such  that  |i(n)|  <  A^|j/(n)|  for  every  n;  i(n)  =  ofyfn))  if  li(n)l/lv(")l  -*  0  as  n  — •  oo; 
x(n)  ~  y{n)  if  x{n)/y{n)  —  1  as  n  — •  oo;  and  i(n)  =  u>(y(n))  if  |x(n)|/|y(n)|  —  oo  as  n  — ►  oo. 
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vector  will  pick  out  that  memory  vector.  The  functioning  of  the  algorithm  as  an  efficient  as¬ 
sociative  memory  has  been  well  documented  (cf.  Hopfield,  1982),  and  theoretical  results  on 
the  capacity  have  been  derived  (McEliece,  Posner,  Rodemich,  and  Venkatesh,  1987;  Komlos 
and  Paturi,  1988).^  We  will  now  review  some  results  for  the  scheme. 

2.1  Error  Correction 

From  the  definition  it  follows  that  W°p  is  symmetric,  nonnegative  definite.  Therefore, 
the  algorithmic  flow  in  state  space  is  towards  the  minimisation  of  bounded  functionals  (the 
Manhattan  Norm  F  in  synchronous  mode  and  the  Energy  E  in  any  mode)."*  The  trajectories 
therefore  will  tend  to  terminate  in  stable  states  which  are  local  minima  of  the  functionals.  If 
these  stable  states  correspond  to  the  stored  memories,  the  Outer  Product  Algorithm  satisfies 
the  requirements  of  a  physical  associative  memory.  To  examine  its  efficacy,  however,  we 
need  to  estimate  its  storage  capacity,  and  the  algorithmic  complexity  of  computing  the 
weights. 

We  first  consider  the  effect  of  on  a  memory  We  have 

j=i 

m 

j¥«  P=i 

=  (n-l)u.<“)  + 

}^i 0^a 

=  (n- l)u.<“)-|-iu.<“)  .  (3) 

We  see  that,  in  effect,  there  is  a  “signal”  term  and  a  “noise”  term.  Assuming  that  the 
®See  also  Chapter  9. 

^In  some  variations  a  zero  diagonal  is  enforced  for  the  matrix  in  (2).  The  matrix  is  then  symmetric, 
with  non-negative  diagonal  elements  so  that  the  energy  is  non-increasing  in  asynchronous  operation. 
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memories  are  chosen  randomly  from  a  sequence  of  symmetric  Bernoulli  trials,  the  noise 
term  has  a  mean  of  0  and  a  standard  deviation  of  y/(n  —  l)(m  —  l).  The  mean  of  the 
absolute  value  of  the  signal  term  (n  -  is  (n  —  1).  Thus,  if  m  =  o(n),  the  signal  term 

dominates  the  error  term,  and  we  can  write 

W°PU  =  (n  -  1)U  +  rfU  «  (n  -  1)U. 

It  will  follow  that  A(W°pU)  =  U  with  high  probability  if  m  =  o(n). 

2.2  Capacity  and  Complexity 

The  memories  q  =  1,  ...  ,m  can  be  identified  as  pseudo-eigenvectors  of  the  linear 

operator  W"p  with  pseudo-eigenvalues  n  -  1.  When  randomly  chosen,  they  are  stable  in 
a  probabilistic  sense  only  if  the  mean  to  standard  deviation  given  by  \/{n  —  l)/(m  —  1)  is 
large.  More  precisely,  the  following  assertion  holds  (cf.  McEliece,  Posner,  Rodemich,  and 
Venkatesh,  1987;  Komlos  and  Paturi,  1988).  Chapter  9  contains  more  details. 

Proposition  3  The  (stable  state)  capacity  of  the  Outer  Product  Algorithm  is  n/4logn. 

We  sketch  one  side  of  the  proof  in  Appendix  A  to  illustrate  some  of  the  ideas  involved.  The 
gentle  reader  is  also  invited  to  delve  into  Chapter  1  for  similar  derivations. 

Somewhat  more  can  be  shown  than  asserted  above.  In  fact,  if  p  €  [0, 1/2)  is  any  fixed 
quantity,  and  random  probes  are  generated  at  a  distance  pn  from  each  of  the  memories,  then 
all  the  errors  in  all  the  probes  are  corrected  in  one  synchronous  step  with  high  probability 
if  the  number  of  memories  m  increases  no  more  rapidly  with  n  than  (1  —  2p)^n/41og7i.  In 
particular,  this  result  implies  that  within  capacity  each  of  the  memories  has  (asymptotically) 
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an  identically  sized  ball  of  attraction  of  radius  pn.  From  a  physical  viewpoint  this  implies 
that  all  the  memories  are  treated  equivalently  by  the  Outer  Product  Algorithm  as  are  all 
the  features  (memory  components). 

If  we  label  the  number  of  elementary  operations  required  to  compute  for  ir  mem¬ 
ories  as  iV°P,  then  by  counting  the  number  of  operations  needed  for  matrix  multiplication, 
by  considering  that  that  the  weight  matrix  is  symmetrical,  and  by  noting  that  the 
diagonal  elements  are  trivially  specified,  we  find  that  =  (mn^  -  mn)/2.  In  addition,  by 
inspection  of  (3),  we  notice  that  the  outer  product  matrix  can  be  recursively  computed  via 

W°P[a]  =  W°P[a  -  1]  -I-  a  >  1, 

where  W°P[a]  denotes  the  outer-product  weight  matrix  generated  by  the  first  a  memories, 
and  W^PfO]  =  0.  This  means  that  the  incremental  complexity  iV‘’'’[a]  =  (n^  -  n),  and  the 
cost  of  computing  the  weight  matrix  incrementally  is  twice  the  cost  of  computing  it  from 
scratch  for  any  given  set  of  m  memories. 

3  MEMORY  SELECTIVE  ALGORITHMS 

In  this  section,  we  will  discuss  schemes  to  generate  the  weight  matrix  to  yield  a  larger 
capacity  than  the  outer  product  scheme.  In  addition,  these  schemes  will  enable  us  to 
selectively  increase  attraction  radii  around  specified  memories.  (Compare  with  the  Outer 
Product  Algorithm  where  uniform  attraction  balls  around  the  memories  obtains.)  The 
constructions  are  an  extension  of  the  outer  product  scheme  to  make  the  memories  true 


eigenvectors  of  the  linear  operator  W,  and  then  specifying  the  eigenvalues  of  the  memories. 
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Construction  1  Define  the  interconnection  matrix  W*  as  follows: 

W*  =  UA(U^U)-^U^,  (4) 

where  A=  dg[A(^),  is  the  m  X  m  diagonal  matrix  of  positive  eigenvalues  A^^^,  . . . , 

A^"')  >  0,  and  U=  [  u(i)  is  the  n  x  m  matrix  of  memory  column  vectors. 

Construction  2  Given  u^'\  . . . ,  choose  any  (n  —  m)  vectors  u^’”"'''^  . . . ,  u^")  €  B" 
such  that  u(^),  . . . ,  u^”)  are  linearly  independent.  Define  the  interconnection  matrix  W*as 
follows: 

W*  =  U<.A„U:\ 

where  the  augmented  matrices  Ua  and  An  are  defined  as 

Aa  =  dg[A(i),  ...,A("'),0,...,0],  Ua  =  [u(')  •.•u(")]. 

We  note  that 

W*U  =  UA,  W*Ua  =  UaAa,  (5) 

SO  that  u^*),  . . .  are  eigenvectors  of  W*  and  A  is  the  spectrum  of  W*  (Personnaz, 

Guyon,  and  Dreyfus,  1985;  Venkatesh  and  Psaltis,  1989).  Therefore,  we  are  guaranteed  to 
have  stable  memories  as  long  as  W*  is  weU  defined. 

Hybrids  of  the  two  methods  described  above  can  also  be  used  where  the  matrix  U  is 
partially  augmented  and  then  the  pseudo-inverse  is  computed. 

3.1  Error  Correction 

For  the  case  of  an  m-fold  degenerate  spectrum,  A^*),  ...,  A^"*^=  A  >  0,  we  see  that  the 


matrix  W*  is  symmetric  with  non-negative  eigenvalues,  i.e.,  it  is  non-negative  definite. 
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Therefore  there  exist  Lyapunov  functions  in  this  case.  In  fact,  consider  the  energy  function 
£(u)  =  -i(u,Wu). 

For  each  memory,  the  energy  is  hence  given  by 

,  Wu(“))  =  -  Y  • 


Let  u  6  B"  be  arbitrary.  We  can  write  u  in  the  form 

m 

u  =  ^  +  Ux  =  U||  +  Ux  , 

a=l 

where  U||  is  the  projection  of  u  into  the  linear  span  of  the  m  memories,  . . ., 

and  Ux  is  a  vector  in  the  orthogonal  subspace.  Then  we  have  {u||,ux)  =  0,  and  by  the 
Pythagorean  theorem, 


||u|p  =  ||U||||’+||Uif  = 


m 

a=l 


+  ll-xll"  • 


The  energy  is  then  given  by 


1 


E{u)  =  - -{u,'Wu)  = 


<  /  m  m 

^  \o=l  0=1 


> 


Or=I 

-511-1^  =  -^. 


It  follows  that  the  stored  memories  form  global  Energy  minima. 

For  the  general  Spectral  matrix  in  (4),  exact  Lyapunov  functions  are  hard  to  come  by. 
The  signal-to-noise  ratio,  however,  serves  as  a  good  ad  hoc  measure  of  attraction  capability. 
Consider  synchronous  operations  with  W*  on  a  state  vector  u=:  6  B".  We  have 


W*u  =  W*(u(“)  +  ^u)  =  W*u(“)  +  W'^u. 
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Once  again,  there  exists  a  “signal”  term,  and  a  “noise”  term,  W*^u.  We  anticipate 

that  the  greater  the  signal- to- noise  ratio,  the  greater  the  attraction  around  Let  the 

Hamming  distance  between  u  and  dw(u,  equal  d,  i.e.,  Pu||  =  2y/d.  The  (strong) 
norm  of  the  matrix  W*  is  defined  as 

It  follows  (cf.  Strang,  1980)  that  ||W*||  =  y/k,  where  k  is  the  the  largest  eigenvalue  of  the 

matrix  (W*)^W*.  For  the  case  of  the  degenerate  spectrum,  =  A  >  0,  W*  is 

symmetric,  and  (W*)^W*  =  (W*)^.  Therefore,  the  maximum  eigenvalue  of  (W*)^W*  = 

k  =  A^,  and  the  signal- to- noise  ratio  (SNR)  is  given  by 

||W-u(^)||  ^  1  In 

l|W*^u||  -(Vk){2s/d)  2Vd‘ 

Thus,  we  would  expect  the  attraction  sphere  around  ...  to  increase  as  n 

increases  for  the  m-fold  degenerate  Spectral  Scheme.  For  the  general  non- degenerate  case, 
we  expect  that  by  varying  the  size  of  A^®^,  the  SNR,  and  hence  the  attraction  capability, 
be  proportionately  increased  or  decreased  for  the  ath  memory  u^“^(Figure  1). 

3.2  Capacity  and  Complexity 

To  determine  the  capacity  of  these  schemes,  we  use  the  following  proposition  (Komlos, 
1967). 

Proposition  4  Let  m  increase  with  n  such  that  m  <  n.  Then  the  probability  that  a  ran¬ 
domly  chosen  set  of  n-tuples  . . .,  6  IB”  is  linearly  independent  aproaches  one  as 


Figure 
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It  follows  that  the  probability  that  W*is  well  defined  approaches  one  as  n  — ►  oo.  Since  a 
linear  transformation  has  at  most  n  eigenvalues,  the  static  capacity  of  the  spectral  scheme 
is  n.  (Dynajnic  capacities  pose  greater  problems  in  evaluation  here  because  of  the  added 
dependency  structure.  For  some  theoretical  results  in  this  regard  see  Dembo  (1989);  for 
numerical  simulations  see  Venkatesh  and  Psaltis  (1989)  and  Chapter  1  of  this  volume.) 

Let  denote  the  number  of  elementary  operations  required  to  compute  the  weight 
matrix  W*  directly  from  the  m  memories  to  be  stored.  Then  using  the  fact  that  (U^U)”^ 
is  symmetric,  we  can  use  the  Cholesky  decomposition  to  compute  its  inverse.  This  along 
with  the  rest  of  the  matrix  multiplications  yields  iV*  =  mn^  +  m?n  +  m^/2  +  (9(n^). 

When  the  eigenvalues  are  m-fold  degenerate,  Greville’s  algorithm  can  be  used  to 
recursively  compute  the  pseudo-inverses  which  in  turn  results  in  a  recursive  construction  for 
the  weight  matrices  W*[a].  Here  W*(a]  denotes  the  spectral  weight  matrix  corresponding 
to  the  first  a  memories.  In  fact,  let  =  A  >  0,  a  >  1.  For  each  q  >  1  let  e^®)  be  the 
n-vector  defined  by 

e(®)  =  (AI- W*[a-  l])u<®\ 

where  we  define  W*[0]  =  0.  Then  it  is  easy  to  verify  by  induction  that 

=  <,>1.  (6) 

Now  let  iV*[a]  denote  the  number  of  elementary  operations  needed  to  compute  the  update 
of  the  weight  matrix  according  to  the  recursion  (6).  Again  counting  the  number  of  multipli¬ 
cations  (the  number  of  additions  is  of  the  same  order),  we  get  the  following  cost  estimate: 
iV'[o]  =  2n®  +  2n,  q  >  1.  Note  that  for  all  choices  of  m  <  n,  we  have  mAr*[a]  >  2N*, 
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so  that,  especially  for  large  n,  the  recursive  construction  of  W*  through  the  updates  (6)  is 
computationally  about  twice  as  expensive  as  the  direct  estimation  of  W*.  Note  that  the 
cost  is  only  about  four  times  more  than  that  of  the  simple  Outer  Product  Algorithm. 

4  FEATURE  SELECTIVE  ALGORITHMS 

4.1  Dual  Spectral  Alogrithm 

The  following  scheme,  formally  related  to  the  Outer  Product  and  Spectral  Algoritnms,  was 
introduced  by  Maruani,  Chevallier,  and  Sirat,  (1987).  Let  U  =  be  the 

matrix  of  memories  as  before.  Let  /3  =  1,  . . .  ,  n  —  m,  be  a  set  of  linearly  independent 
vectors  in  IR"  which  are  individually  orthogonal  to  each  of  the  memories,  i.e.,  X^U  =  0, 
where  we  define  the  n  x  (n  —  m)  matrix  X  =  x^^^  •  •  •  Define  a  weight  matrix 

W  with  weights  Wij  given  by 

,,,  _  /  -  »■  #  j 

ifi  =  j  ’ 

where  Xk0  is  the  /cth  component  of  x^^^.  If  we  define  (ii  =  1^0=^  ~  1,  . . .,  n,  we  see 

that 

W  =  M  -  XX^  (7) 

where  M  =  dg  [/ii ,  . . . ,  /i„].  Thus, 

WU  =  MU-XX^U 


MU. 


(8) 


4  FEATURE  SELECTIVE  ALGORITHMS 


15 


4.2  Error  Correction 

Comparing  (5)  and  (8)  we  see  that  the  Spectral  and  Dual  Spectral  Algorithms  exhibit  an 
interesting  duality. 

In  the  Spectral  Scheme,  the  eigenvectors  of  W*  are  the  memories,  so  that  the  column 
space  of  W*  is  given  by  the  span  of  the  memories.  Therefore,  if  the  memories  are  far  enough 
from  each  other  and  the  initial  state  vector  u  is  close  enough  to  a  memory,  W*  combined 
with  the  thresholding  operation  projects  u  onto  the  memory. 

In  the  Dual  Spectral  Scheme,  since  the  parameters  /i,  are  positive  for  each  choice  of  i, 
it  follows  that 

A(Wu“)i  =  =  uf ,  for  each  t  =  1, . . . ,  71,  a  =  l,...,m. 

So  the  memories  ...  ,  are  fixed  points  in  the  scheme  as  well.  W  as  defined  in  (7) 
is  a  zero-diagonal  symmetric  matrix.  Thus,  we  know  that  there  exist  Lyapunov  functions 
in  both  modes  of  operation  and  that  the  network  will  exhibit  some  form  of  attraction 
behaviour.  The  weight  matrix  is  obtained  by  taking  the  correlation  of  vectors  that  are 
orthogonal  to  the  memories  and  then  setting  the  diagonal  elements  to  be  0.  In  creating  the 
zero  diagonal,  we  essentially  add  perturbations  to  the  left  nullspace  of  U  in  the  directions 
of  the  memories.  The  strength  of  the  perturbations  along  any  component,  i,  is  proportional 
to  Pi.  Thus,  each  of  the  pi's  corresponds  to  a  directional  distortion,  and  we  expect  the  SNR 
of  the  Dual  Spectral  Scheme  to  vary  from  direction  to  direction  proportionately  with  the 
value  of  Pi.  We  therefore  expect  that  the  larger  the  /t,,  the  more  the  information  that  is 
lost  if  the  tth  bit  is  flipped  and,  hence,  the  smaller  the  attraction  in  the  ith  direction. 


4  FEATURE  SELECTIVE  ALGORITHMS 


16 


As  an  illustration,  let  us  consider  the  case  where  n  =  3,  and  Hy  '>  Hx,  M*-  Each  memory 
u  would  be  preferentially  attracted  in  the  x  and  z  direction,  indicated  schematically  by  an 
attraction  spheroid  in  Figure  2;  i.e.,  a  vector  with  a  different  y  component  is  less  likely  to  Figure  2  goes  here 
map  back  to  u  but  vectors  with  different  x  and/or  z  components  will  probably  be  within 


the  attraction  region  of  u.  In  other  words. 


—u 


L  V 


V 

Uz 
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4.3  Constructing  Feature  Selective  Weights 


In  the  previous  section,  the  orthogonal  basis,  X,  was  chosen  arbitrarily  and  therefore  re¬ 
sulted  in  some  lack  of  control  in  specifying  attraction  capability.  As  we  argued  above,  the 
/i,  ’s  essentially  control  directional  attraction  and  we  have  no  means  of  specifying  these  under 
the  above  approach.  We  now  suggest  a  few  schemes  for  constructing  the  weight  matrices 
that  specify  the  p-values  and  thereby  achieve  direction-specific  attraction.  Specifically,  for  a 
prescribed  set  pi,  . >  0  of  directional  attraction  strengths,  and  M  =  dg  [/xi ,  . . . ,  //„], 
we  require  a  weight  matrix  such  that 

W'^U  =  MU  .  (9) 


We  define  W”*  such  that: 


f  -  120=Ti^iph)i^j0h)  -f  *  7^  j 
I  0  if  i  =  j 


(10) 


where  Xip  is  the  ith  component  of  the  basis  vector  as  defined  earlier,  and  bp  is  the  /3th 


component  of  a  vector  which  we  will  specify  shortly.  Thus,  given  pi,  . . pn  we  need  to 
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find  a  vector  b  such  that  with  Y  =Xb 

W‘^  =  M-YY^.  (11) 

(Note  that  the  columns  of  Y,  in  general,  are  not  orthogonal.) 

Assuming  that  has  the  form  given  in  (10),  let  us  now  consider  the  effect  of  on 
the  zth  element  of  a  memory 

j=\ 

n—m 

/3=1 

n  n—m  n—m 

=  -  £  ^j0  “1“^  + 

j=l  /3=1  P=l 

n-m 

/9=1 


We  require  from  (9)  that  where  >  0.  By  inspection,  we  obtain  the 

relationship 

/^.  = 

0=1 

Define  Oip  =  x^p,  and  cp  =  6^.  Then  we  require  Ac  =  M^,  where  A  is  a  known  n  x  (n  —  m) 
matrix  with  non-negative  elements  Oip  —  x^p,  c  is  an  unknown  (n  —  Tn)-dimensional  vector 
with  cg  =  tip  constrained  to  be  non-negative,  and  is  a  specified  n-dimensional  vector 
with  positive  components  ^i,  . . /i„. 

We  notice  that  this  is  an  overspecified  system  of  n  equations  with  (n-m)  unknowns, 


where  both  c  and  are  constrained  to  have  non-negative  elements.  Linear  programming 
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techniques  can  be  used  to  solve  this  system  of  equations.  We  can  choose  the  /i- values  in  a 
variety  of  ways.  A  few  representative  methods  are: 

1.  Specify  /ii, . . .  <  n  —  m ,  let  +  1, . .  •  ,/in  ~  ^  <  ‘tiid  minimise  e. 

2.  Specify  /ii,  . . .,  /in,  and 

(a)  Minimise  the  mean-square  error  given  by 

n 

l|Ac  ^  ^  -t"  *  *  •  -f-  Qx',n— mCn— m  /^i) 

i=l 

subject  to  the  constraints  >  0,  c  >  0. 

(b)  Minimise  the  largest  absolute  error,  cq,  given  by 

max(|€i|,  ...,  |c„|) 

where  c,  ,  the  error  in  /x,  ,  is 

^^i  "t"  "  •  *  "t"  0«,n— mCn— m),  1  =  1,  .  .  .  ,  Tl  . 

For  simplicity,  we  consider  algorithms  employing  the  first  linear  programming  approach 
outlined  above.  We  have  modified  the  initial  basis  for  the  nuUspace  of  U  using  the  results 
of  the  simplex  method  such  that 

=  M  -  YY^  , 

where  M  =  dg[/ii,  ^2,  — ,  /^n]  with  /xi,  >  0  specified  by  us,  0  <  fik+i,  •••,  /^n  <(  < 

min(/xi,  . . .,  /Xfe),  and  Y  =  Xb  is  a  set  of  basis  vectors  for  the  left  nuUspace  of  U.  Since 
fii,  i  =  1,. .  .,n,  are  positive,  we  see  that  all  the  memories  are  strictly  stable  in  the  Dual 
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Spectral  Scheme  as  long  as  the  memories,  ... ,  u^’"\  are  linearly  independent,  and  we 
are  able  to  find  the  vector  c  in  the  system  (12)  through  linear  programming.  In  Appendix  B 
we  outline  the  linear  programming  approach  in  some  detail. 

4.4  Capacity  and  Complexity 

By  Komlos’  result  (Proposition  4),  we  are  guaranteed  that  almost  all  choices  of  n  memories 
or  fewer  are  linearly  independent,  so  that  for  almost  all  choices  of  n  —  1  memories  there 
is  an  orthogonal  subspace  of  dimension  1,  while  almost  aU  choices  of  n  memories  span  the 
space  IR"  and  therefore  the  orthogonal  subspace  is  of  dimension  0.  The  storage  capacity  of 
the  Dual  Spectral  Scheme  of  (10)  is  directly  n  —  1,  since  n  —  1  is  the  number  of  memories 
for  which  we  can  still  specify  a  left  nuUspace  X. 

To  find  an  n-dimensional  vector  under  constraints,  the  Simplex  Method  iterates  from 
one  feasible  solution  to  another  until  it  finds  an  optimal  feasible  solution.  In  the  worst  case, 
we  need  to  test  each  vertex  of  the  feasible  region  which  is  an  n-sided  polyhedron,  leading 
to  2"  —  1  iterations.  However,  such  cases  are  rare  and  require  careful  specification  of  the 
constraints  designed  solely  to  approach  the  worst  case  behaviour.  In  practice,  it  has  been 
widely  reported  (Chvital,  1983;  Murty,  1983)  that  the  number  of  iterations  is  almost  always 
between  1  to  3  times  the  number  of  constraints.  Thus,  for  the  case  of  specifying  k  values  of 
we  would  expect  at  the  most  3n  iterations.  For  the  (Revised)  Simplex  Method,  a  good 
estimate  of  the  average  cost  of  each  iteration  in  our  scheme  is  52n  -  10m  -  lOA:  +  10,  while  for 
the  Standard  Simplex  Method,  a  good  estimate  is  2n^  -  mn  -  kn  +  ti)/4  (cf.  Chvatal,  1983, 
p.  113).  Thus,  we  estimate  that  the  total  cost  of  specifying  k  values  of  is  0{n^)  (using 
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the  Revised  Simplex  Method).  The  cost  of  finding  a  basis  for  the  nullspace  of  U  (through 
Gram-Schmidt  orthogonalisation)  includes  finding  (U^U)~^  and  two  matrix  multiplications 
and  is  given  by  mn^  H-  (m^n)/2  —  m^/2  +  O(n^).  Finally,  the  cost  of  finding  from  c 
and  X  is  —  n^m  +  O(n^).  So,  we  can  say  that  on  the  average, 

N‘^  =  ~  ^  +  0{n^)  , 

where  N'^  is  the  number  of  elementary  operations  needed  to  compute  W"^. 

5  COMPOSITE  ALGORITHMS 

In  Section  3  we  saw  ways  of  increasing  the  radii  of  attraction-spheres  around  memories.  In 
Section  4.3  we  saw  ways  of  specifying  increased  attraction  in  certain  directions  around  each 
of  the  memories.  A  natural  extension  of  these  schemes  is  to  create  a  Composite  Scheme 
(Venkatesh,  Pancha,  Psaltis,  and  Sirat,  1990)  with  weight  matrix  given  by 

=  W*  -I-  W*^. 

Since  is  a  linear  combination  of  W*  and  W*^,  we  would  expect  memories  to  be  sta¬ 
ble  in  the  Composite  Scheme  for  reasons  decribed  in  the  previous  sections.  The  idea  of 
the  Composite  Scheme  is  to  specify  both  memory-specific  attraction  by  specifying  A  for 
each  memory,  and  direction-specific  attraction  by  specifying  /x  for  the  individual  directions 
(Figure  3). 

Here,  the  spectrum  of  W*  is  no  longer  degenerate,  and  W'^,  consequently,  is  no  longer 
symmetric.  As  the  Composite  Algorithm  combines  the  memory-specific  Spectral  Algorithm, 
and  the  direction-specific  Dual  Spectral  Algorithm,  it  works  effectively  in  shaping  the  at- 


Ftgure  3  goes  he 


6  SIMULATIONS 


21 


traction  regions  as  desired.  It  should  be  noted  that  the  relative  values  of  the  ,  A**”), 

compared  to  the  . . . ,  need  to  be  considered  in  order  not  to  lose  the  effects  of  one  of 
the  two  parts  of  the  composite  scheme. 

Note  that  the  capacity  of  the  Composite  Scheme  is  n  -  1.  The  algorithm  complexity 
of  the  Composite  Scheme  is  the  sum  of  the  complexities  of  the  Spectral  and  Dual  Spectral 
Schemes,  except  that  we  need  not  find  (U^U)  twice.  Therefore  the  complexity,  N‘^,  is 
given  by  3n^  +  0{n^)  for  m  ^  n. 

6  SIMULATIONS 

There  are  a  number  of  open  questions  involved  with  the  schemes  outlined  above.  In  this 
section,  we  discuss  trends  observed  in  computer  simulations  that  validate  some  of  our  conjec¬ 
tures.  All  memories  were  chosen  randomly  using  a  binomial  pseudo-random  generator.  Test 
input  vectors  at  specified  Hamming  distances  from  the  memories  were  generated  by  revers¬ 
ing  the  signs  of  randomly  chosen  components  for  the  Outer-Product  and  Spectral  schemes. 

In  the  case  of  the  Dual  Spectral  schemes,  test  input  vectors  were  generated  by  reversing  the 
signs  of  randomly  chosen  components  with  the  specified  or  unspecified  n  values. 

Analytical  bounds  are  difficult  to  arrive  at  for  the  attraction  radius  as  a  function  of 
the  number  of  memories  and  the  dimension  of  the  state  space.  Figure  4  plots  the  attraction  Figure  J,  goes  hue 
radius  Spectral  Scheme.  The  attraction  radius  was  estimated  by  averaging  the  maximum 
Hamming  distance  of  error-correction  around  stable  memories  over  several  independent 
runs.  The  memories  were  divided  into  two  equal  sized  groups,  one  group  with  eigenvalue 
A(large)  =  3n,  and  the  other  group  with  eigenvalue  A(small)  =  n.  The  respective  attraction 
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radii  of  the  A(large)  memories  and  the  A(small)  memories  are  plotted  against  the  number 
of  memories  m. 

In  the  Dual  Spectral  Schemes,  there  is  the  question  of  the  number  of  directions,  k, 
that  can  be  specified  given  a  set  of  m  memories  and  n  neurons,  arising  from  the  nature  of 
the  construction  of  the  matrix.  It  is  obvious  from  the  previous  discussions  about  the 
dimensions  of  A  and  c  that  we  can  surely  specify  no  more  than  n  —  m  directions.  However, 
there  is  a  possibility  (albeit  small)  that  there  exist  no  feasible  solutions  for  pathological 
cases  where  k  <  n  —  m.  Another  quantity  we  are  interested  in  is  the  size  of  e,  the  largest 
of  the  unspecified  ^’s,  compared  to  the  size  of  the  specified  /i’s  since  we  have  conjectured 
that  this  will  affect  directional  attraction. 

While  there  exists  little  theory  for  the  Simplex  Method  which  will  enable  us  to  gauge 

these  parameters,  simulations  show  that  e  is  typically  small  (Figure  5)  compared  to  m  Figure  5  goes  he\ 

for  the  specified  directions  (<  0.5 /Xj),  and  k  is  typically  of  the  order  of  n/4  in  the  ranges 

simulated.  We  conjecture  that  this  behaviour  continues  to  hold  for  large  n.  Figures  6(a) 

and  6(b)  plot  directional  attraction  for  the  Dual  Spectral  Scheme.  Attraction  data  for  a  Figures  6( 

and  6(h)  go  hert 

given  direction  were  generated  by  investigating  probe  vectors  at  various  Hamming  distances 
from  a  memory  with  the  component  of  the  probe  in  the  direction  being  investigated  being 
chosen  to  be  opposite  in  sign  to  the  corresponding  component  of  the  memory.  Flipping  a 
bit  in  an  imporant  (large  /x)  direction  almost  always  reduced  the  attraction  to  the  memory 
compared  to  an  unimportant  (small  /x)  direction. 

Figures  7(a)  and  7(b)  plot  the  results  for  the  composite  scheme.  The  memories  were  Figures  7(u 

’rirf  7(b)  go  /irrfB 
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divided  into  two  groups,  one  group  corresponding  to  a  “large”  eigenvalue,  Xig  =  3,  and 
the  other  group  corresponding  to  a  “small”  eigenvalue,  X^rn  =  1-  Directional  attraction 
parameters  for  k  directions  were  set  equal  to  fUg  ~  16.  Attraction  radii  were  determined 
for  the  large  fi  (specified)  and  small  fi  (unspecified)  directions  as  a  function  of  k,  for  both 
the  large  eigenvalue  memories  and  the  small  eigenvalue  memories.  As  can  be  seen,  there 
is  degradation  of  the  behaviour  when  compared  to  the  memory-specific  and  feature-specific 
schemes.  However,  the  attraction  behaviour  is  in  keeping  with  our  expectations,  and  we 
conjecture  that  the  behaviour  of  these  networks  improves  as  n  increases. 

7  SUMMARY 

The  recurrent  neural  network  paradigm  for  associative  memory  is  attractive  in  its  simplicity 
and  computational  tractability.  The  classical  Outer  Product  Algorithm,  for  instance,  has 
very  low  implementation  complexity  and  yet  exhibits  near-linear  memory  storage  capacities 
with  correction  of  a  linear  number  of  random  errors  uniformly  in  balls  around  the  memories. 
In  this  chapter  we  have  shown  how  it  is  possible  to  exercise  a  macroscopic  degree  of  nonuni¬ 
form  error  correction  around  the  stored  memories.  In  particular,  both  memorv  and  feature 
selective  error  correction  is  feasible  using  a  composite  algorithm  which  exploits  the  spec¬ 
tral  characteristics  of  the  interconnectivity  weight  matrix.  These  spectral  based  approaches 
are  near-optimal  in  character  and,  in  particular,  are  characterised  by  low  implementation 
complexities  and  linear  storage  capacities.  Extensions  of  these  approaches  are  possible  to 
higher-order  neural  networks  (where  the  model  neurons  compute  the  sign  of  polynomial 
forms  of  their  inputs)  with  concomitant  increases  in  the  storage  capacities  of  the  networks 
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(cf.  Venkatesh  and  Baldi,  1991a,  1991b). 

APPENDIX  A:  PROPOSITIONS 

Proposition  1  E{u)  is  non-increasing  in  asynchronous  mode  ifW  is  symmetric  and  has 
non-negative  diagonal  elements;  f^(u)  is  non-increasing  in  any  mode  if  W  is  symmetric 
and  non-negative  definite. 

Proof:  (a)  Consider  either  a  synchronous  or  asynchronous  mode  of  operation.  For  any 
state  11,  the  algorithm  results  in  a  flow  in  state  space  defined  by  u  i— ►  u  +  ^u.  We  note  that 
is  an  n-tuple  with  each  component  taking  on  one  of  the  values  0,  -2,  or  2.  The  change 
in  E  is  given  by 

6E  =  £(u  +  ^u)  -  £;(u) 

=  -|[(^u,Wu)  +  (u,W^u)  +  (^u,W6u)]. 

Since  W  is  symmetric,  (iu,  Wu)  =  (u,W^u).  Hence,  we  have 

6E  =  -(^u,Wu)  -  l(5u,W6u). 

We  note  that  the  nature  of  the  algorithm  is  such  that  the  sign  of  each  component  of 
is  the  same  as  that  of  the  corresponding  component  of  Wu.  Thus,  the  inner  product 
((5u,  Wu)  >  0  for  every  state  vector  u  €  B".  Furthermore,  if  W  is  non-negative  definite, 
the  quadratic  form  (^u,W^u)  >  0.  Thus  6E  <  0  and  E  is  a  monotone  non-increasing 


function 
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(b)  In  the  synchronous  mode,  assume  that  the  kth  neuron  updates  itself  at  epoch  t. 
We  therefore  have 

r  r  f  0  if  t  ^  fc 

The  change  in  energy  6E  is  given  by 


6E  =  -Suk  -  ^{Sukfwkk  ■ 


Once  again,  the  algorithm  ensures  that  Suk  is  of  the  same  sign  as  and  since 

Wkk  is  non-negative,  we  see  that  6E  <  0,  and  E  is  montone  non-increasing.  I 


Proposition  2  F(u)  is  non-increasing  in  synchronous  mode  i/ W  is  symmetric. 

Proof:  In  the  synchronous  mode  of  operation  the  change  in  the  Manhattan  Norm  is  given 

by 


6F  =  F{u[t  +  1])  -  F(xi[t]) 

inn  I  n  n 

=  “  9  H  ^  Up[t]WpgUg[t  -  1], 

^  i=l  j=l  ^  p=l  9=1 

As  W  is  symmetric,  we  have  Wpq  =  Wgp  and  Up[t]wgp  =  uj,jUj[t].  So 

SF  =  -^  ^(u,[t  -I-  1]  -  Ui[t  -  1])  WijUj[t]. 

^  .=1  j=i 

Let  I  be  the  set  of  indices  for  which  Ui[t  -t- 1]  =  — Wi[<  -  1]  (Note  that  I  can  be  empty).  Then 

n 

SF  =  -  u.(<  +  1]  53 

i€l  J=1 
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lice  again  we  note  that  the  nature  of  the  algorithm  guarantees  us  that  the  two  sums  in 
the  above  equation  are  of  the  same  sign.  Thus, 

iel  i=i 

and  F  is  a  monotone  non-increasing  function  along  trajectories  in  state  space.  I 


Proposition  3  The  (stable  state)  capacity  of  the  Outer  Product  Algorithm  is  n/41ogn. 


Proof:  We  will  prove  only  that  n/4logn  is  a  lower  bound  for  the  stable  state  capacity  of 
the  Outer  Product  Algorithm.  Consider  the  random  sums 


^(».or)  _  yjo)  ^  =  n  -I-  m  —  1  -|-  ^  ^  1  <  *  <  1  <  a  <  m. 

i=i  j^i  0^a 


Fix  i  and  a,  and  write  simply  Xn  instead  of  Xn’^K  We  can  now  write 


0^a j^i 

where,  for  fixed  i  and  a,  we  define  the  random  variables  Note 

that  the  random  variables  j  ^  i,  (3  ^  a  are  i.i.d.,  symmetric,  ±1  random  variables.® 
Now  recall  that  a  simple  application  of  Chebyshev’s  inequality  yields  that  for  any  random 
variable  Y  and  any  t  >  0, 


P{Y  <  -0  <  infe-"‘E(c-^''). 

r>0 


*The  critical  fact  here  is  that  each  random  variable  has  a  distinct  multiplicative  term  which 
occurs  solely  in  the  expression  for  Zj^\ 
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Applying  this  result  here  we  obtain 


P{X„  <0}  =  ^  <  -n  -  m  +  1 


l/S/or 


<  inf  E|e"’‘  | 

=  i,  Ejnne-<’). 


The  terms  in  the  product,  6“’^'^  ,  (3  a,  j  i  are  independent  random  variables  as  the 
random  variables  are  independent.  The  expectation  of  the  product  of  random  variables 
above  can,  hence,  be  replaced  by  the  product  of  expectations.  Accordingly,  denoting  by  Z 
a  random  variable  which  takes  on  values  -1  and  1  only,  each  with  probabilty  1/2,  we  have 

PlATn  <  0}  <  inf  =  inf  (cosh 


Now,  for  every  r  €  IR  we  have  coshr  <  Hence 


<  0}  <  inf  exp^ 


r2(m- l)(n- 1)  ,  .  f  {n  +  m  -  1)^ 

- 2 - ’^<"'^"-l))=M-2(n---T)(m:-l) 


We  hence  obtain  that  the  probability  that  any  given  component  of  a  memory  is  not  stable 


is  bounded  by 


p{j5^(X.a)  <  0}  <  e-"/*”* 


for  large  enough  n  provided  m  grows  so  that  m  =  o(n)  and  m  =  u{y/n). 


Denote  the  event 


^^n=(U 

l.=:lor=I  J 
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that  one  or  more  of  the  nm  memory  components  is  not  stable.  A  simple  application  of 
Boole’s  inequality  now  yields 

P{^n}  <  ^  <  0}  <  nm  exp|-  | . 

t  =  l  0  =  1 

For  a  choice  of 

log  log  n  + log  4c  _  /loglognNI 
21ogn  V(logn)2yJ 

we  hence  obtain  that  P{^^n}  <  c  as  n  — ♦  oo.  As  the  probability  that  each  of  the  memo¬ 
ries  is  stable  is  exactly  1  —  P{£^n}5  this  establishes  that  the  capacity  is  at  least  n/4  log  n.  I 


Proposition  4  Let  m  increase  with  n  such  that  m  <  n.  Then  the  probability  that  a  ran¬ 
domly  chosen  set  of  n-tuples  €  B"  is  linearly  independent  aproaches  one  as 

n  —*  oo. 

This  result  follows  directly  from  a  result  of  Komlos  ( 1967)  asserting  that  the  probability 
that  a  random  n  x  n  ±1  matrix  is  nonsingular  approaches  one  as  n  — *  oo. 

APPENDIX  B:  LINEAR  PROGRAMMING 

1.  Specify  pi,...  ,/ifc,  h  <  n  -  m ,  let  -b  1,...  ,/i„  -  m  <  c,  and  minimise  c.  The 
canonical  form  of  the  linear  programming  problem  that  the  Simplex  Method  solves  is: 
Minimise  the  goal  function  c^y  subject  to  the  constraints 

Ay  =  b  , 


where  the  vector  y  is  unknown,  and  y  >  0. 
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In  this  case,  we  specify  k  positive  values  of  and  minimise  the  maximum  of  the 
(n  —  k)  unspecihed  values  of  subject  to  the  constraints  /xfc+i,  fin  >  0,  and 
Cl,  . . .,  Cn-m  >  0.  In  other  words,  we  have  the  following  equations 


Ol.lCl 

4-- 

’  ■  “1"  ^l,n— m^n— m 

=  Pi 

4-- 

‘  ‘  “1"  m^n— m 

=  Pk 

flfc+l.lCl 

4-- 

■  *  ^Jb+l,n— m^n— m 

<  € 

«n,lCl 

4-- 

■  *  4"  ^n,n— mCfi— m 

<  € 

where  c,-  >  0,  €  >  0,  and  we  want  to  find  c  which  minimises  e. 

To  convert  the  n  —  k  inequalities  to  equalities,  we  subtract  e  from  both  sides  of  the 
equation  and  add  slack  variables  zi,..., Zn-k  to  give  us  the  following  n  —  k  equations 

®fc+l,lCl  "b  '  '  ■  4"  flfc+l,n— m^n— m  f  4"  =0 

4"  •  •  •  4"  fln.n— m*^n— m  f  4"  l^n—k  —  0  , 

in  addition  to  the  first  k  equations.  Now  we  have  n  equations  with  2n  -  m  —  k 
unknown  non-negative  quantities  (ci, . .  •  •  •, Let  us  label  e  as  Cq. 

By  inspection,  we  see  that  the  goal  function  to  be  minimised  is  cq,  subject  to  the 
constraints  A'c'  =  M^'  where  c'  is  a  (2n  -  m  -  k  +  l)-dimensional  vector  is  a 
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n- dimensional  vector,  and  A'  is  an  n  x  (2n  -  m  -  Ar  +  1)  matrix;  i.e.,  we  require  to 
solve 

Co 
Cl 

Cn— m 

Zn-k 

and  Ci,Zi  >  0.  This  is  in  the  canonical  form  for  the  Simplex  Method. 

2.  Specify  /ii,  . . . ,  /x„,  and 

(a)  Minimise  the  mean-square  error  given  by 

n 

||AC  -  (Oj.iCi  +  •  •  •  +  ai,n-mCn-m  “  A*.)* 

<=1 

subject  to  the  constraints  >  0,  c  >  0. 

This  is  a  quadratic  programming  problem.  However,  this  problem  can  be  refor¬ 
mulated  as  a  Simplex  Method  problem  and  can  be  solved  using  a  variation  of 
the  traditional  simplex  method  called  Wolfe’s  method  (Wolfe,  1959). 

(b)  Minimise  the  largest  absolute  error,  cq,  given  by 

max(|ei|,  ...,  lt„|) 

where  tj,  the  error  in  /i,,  is 

l^i  "1“  '  *  ’  “I"  m )i  ®  ,  71  , 
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Our  problem  now  is  to  minimize  cq  subject  to  co,ci, . . . ,  c„_m  >  0.  To  solve  this 
problem,®  we  note  that  we  have  n  pairs  of  inequality  contraints  of  the  form 

—  Cq  +  tit, 1^1  + - h  fli.n-mCn-m  <  Pi 

Cq  ‘  "  U«,n— m^n— m  ^  Pi 

The  addition  of  slack  variables  puts  the  problem  in  canonical  form. 
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Figure  Captions 

Fig.  1  Schematic  representation  of  the  directional  attraction  space  around  two  memories 
with  different  eigenvalues  with  Spectral  algorithms. 

Fig.  2  Schematic  representation  of  the  directional  attraction  space  in  the  Dual  Spectral 
Scheme  with  fiy  >>  Hx,yiz- 

Fig.  3  Schematic  representation  of  the  joint  memory-specific  and  direction-specific  attrac¬ 
tion  space  for  two  memories  in  the  Composite  Scheme. 

Fig.  4  Average  attraction  radii  around  stable  memories  in  the  Outer-Product  Scheme. 

Fig.  5  Attraction  radii  in  the  Spectral  Scheme. 

Fig.  6  Variation  of  c,  the  largest  of  the  unspecified  directional  parameters,  Pfc+i,  . . .,  Hni 
as  a  ratio  of  the  specifed  directional  parameters  with  k,  the  number  of  directions 
specified. 

Fig.  7  Demonstration  of  direction-specific  attraction  in  the  Dual  Spectral  Scheme. 

Fig.  8  Demonstration  of  memory-specific  and  direction-specific  attraction  in  the  Compos¬ 


ite  Scheme. 
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On  Reliable  Computation  with  Formal  Neurons 

Santosh  S.  Venkatesh  and  Deraetri  Psaltis 


Abstract — We  investigate  the  computing  capabilities  of  fonnal 
McCuUoch-Pitts  neurons  when  errors  are  permitted  in  decisions. 
Specifically,  given  a  random  m-set  of  points  u\ ■  ■  ■  a 

corresponding  m-set  of  decisions  €  {  —  1,1},  and  a  fractional 

error-tolerance  0  <  e  <  1 ,  we  are  interested  in  the  foiiowing  question: 
How  large  can  we  choose  m  such  that  a  formal  neuron  can  make 
assignments  u°  — •  d° ,  with  no  more  than  (m  errors?  We  obtain  formai 
results  for  two  protocols  for  error-tolerance — a  random  error  protocol 
and  an  exhaustive  error  protocol. 

In  the  random  error  protocol,  a  random  subset  of  the  m  points  is 
randomly  and  independently  specified  and  the  associated  decisions  labeied 
“don’t  care.”  We  prove  that  if  m  is  chosen  less  than  3n/(l  —  3e),  then 
with  high  probability,  there  is  a  choice  of  weights  for  which  the  expected 
number  of  decision  errors  made  by  the  neuron  Is  no  more  than  cm;  if  m 
is  chosen  larger  than  3n/(l  -  3e),  then  the  probability  approaches  zero 
that  there  is  a  choice  of  synapUc  weights  for  which  the  expected  number 
of  decision  errors  made  by  tte  neuron  is  fewer  than  cm. 

In  the  exhaustive  error  protocol,  the  total  number  of  decision  errors 
has  to  lie  below  em,  but  we  are  allowed  to  choose  the  set  of  decisions  that 
are  in  error.  We  show  that  there  is  a  ftinctioo  1  <k,  <  605  such  that  if 
m  exceeds  3k(Ii/(1  —  3e),  then  there  is,  with  high  probability,  no  choice 
of  syruptk  wei^ts  for  which  a  neuron  makes  fewer  than  em  decision 
errors  on  the  m-set  of  inputs.  For  small  (.  the  function  k,  is  close  to  1 
so  that.  Informally,  we  can  specify  m-sets  as  large  as  3n/(l  -  3<)  (but 
not  larger)  and  obtain  reliable  decteions  within  the  prescribed  tolerance 
for  some  suitable  choice  of  weights. 

Index  Terms — Capacity,  computatioo,  buh-lolerance,  formal  neurons, 
large  deviatioos,  reliabUity. 


1.  Introduction 

The  formal  modeling  of  biological  neurons  as  linear  threshold  gales 
dates  to  the  seminal  paper  of  McCulloch  and  Pitts  [3].  Although  the 
biological  plausibility  of  these  models  is  open  to  debate,  extensive 
investigations  since  the  work  of  McCulloch  and  Pitts  have  shown  that 
considerable  computational  power  is  latent  in  networks  of  formal 
neurons. 

In  its  simplest  form,  a  formal  neuron  is  a  computational  device 
that  accepts  n  real  inputs  and  produces  a  single  bit  output  depending 
on  whether  a  weighted  sum  of  the  inputs  exceeds  a  fixed  threshold. 
If  the  inputs  are  constrained  to  be  Boolean  variables,  then  the  neuron 
simply  realizes  a  Boolean  function  of  n  variables. 

A  fundamental  counting  result  (cf.  Schlafli  [4],  Wendel  [S],  and 
Cover  [6])  helps  quantify  the  computational  capability  of  a  neuron: 
for  any  m  set  in  Euclidean  n  space,  the  result  gives  a  precise  count 
of  the  number  of  dichotomies  of  the  set  that  can  be  separated  by  a 
neuron.  Each  dichotomy  is  a  collection  of  m  decisions  made  by  the 
neuron  on  the  m  set.  Schlafli’s  theorem,  therefore,  gives  (he  number 
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of  distinct  sets  of  decisions  that  a  neuron  can  make  on  a  given  set 
of  data.  The  theorem  can  be  used  to  estimate  the  maximum  number 
of  decisions  that  can  be  reliably  made  by  a  neuron;  this  number  is 
linear  in  n,  as  we  will  see  in  the  sequel. 

In  the  considerations  above,  we  have  tacitly  assumed  that  functions 
are  computed  without  errors.  Although  this  is  the  norm  in  most  logical 
functions  implemented  on  digital  computers,  computations  involving 
cognitive  tasks  such  as  pattern  recognition,  however,  are  frequently 
less  exact.  It  is  hence  reasonable  to  wonder  whether  allowing  a  formal 
neuron  the  latitude  to  make  errors  can  substantially  increase  the  set 
of  problems  that  it  can  handle.' 

Assume  that  a  set  of  m  decisions  are  to  be  made  on  a  randomly 
specified  nt  set  of  points  in  n  space  and  that  we  allow  an  error 
tolerance  of  em  decision  errors,  with  0  <  e  <  1/2.  We  are 
interested  in  how  large  we  can  choose  m  such  that  the  neuron 
makes  reliable  decisions  within  the  prescribed  error  tolerance.  A 
superficial  consideration  of  the  problem  might  indicate  that  substantial 
gains  in  computation  may  be  achievable  if  errors  are  permitted, 
as  the  following  analysis  indicates.  The  number  of  ways  that  em 
errors  can  occur  in  decisions  is  ( .'ll, )  and  corresponding  to  each 
such  specification  of  the  em  incorrect  decisions,  there  is  a  set  of 
(1  —  <)m  decisions  that  are  required  to  be  correct.  Hence,  the  neuron 
is  only  required  to  realize  any  one  of  ( )  distinct  sets  of  { 1  -  <  )m 
decisions  reliably.  For  large  m,  [^)  =  11(2'^”')  for  a  positive  con¬ 
stant  c  so  that  there  is  an  exponential  number  of  distinct  choices  of 
which  the  neuron  has  to  implement  only  one.  It  hence  appears  that 
there  may  potentially  be  substantive  computational  gains  to  be  made 
if  we  allow  some  error  tolerance  in  the  decisions.  Our  main  result  in 
this  paper,  however,  indicates  that  such  gains  are  not  actually  realized, 
and  the  maximum  number  of  decisions  that  can  be  made  by  a  neuron 
under  such  circumstances  remains  linear  in  n;  specifically,  we  prove 
the  following  results. 

•  The  sequence  2n/(l  -  2f)  is  a  threshold  function*  for  the 
property  that  there  is  a  choice  of  synaptic  weights  for  which 
the  neuron  makes  no  more  than  (essentially)  rm  random  errors 
in  decisions.  In  particular,  if  m  is  less  than  2n/(l  -  2f).  then 
with  high  probability,  there  is  a  choice  of  weights  for  the  neuron 
such  that  the  expected  number  of  errois  is  fewer  than  rm;  if  m 
exceeds  2n/(]  -  2f).  then  the  probability  approaches  zero  that 
there  is  a  choice  of  weights  for  the  neuron  such  that  the  expected 
number  of  errors  is  fewer  than  em. 

•  There  is  a  function  1  <  «,  <  505  such  that  if  m  exceeds 
2«,n/(l-2f),  then  there  is  (asymptotically)  no  choice  of 
synaptic  weights  for  which  a  neuron  makes  fewer  than  em 
decision  errors  on  the  rn  set  of  inputs. 

In  the  next  section,  we  develop  some  notation  and  introduce  the 
notions  of  e  reliability  and  capacity  function.  The  main  theorems 

'Note  that  we  assume  that  we  have  complete  control  over  the  neuron 
parameters  and  that  errors  aeep  uiio  the  neural  output  because  we  are 
overloading  the  capacity  of  the  neuron  The  notion  of  reliability  of  the 
decisions  here  is  somewhat  different  from  the  case  where  decisions  arc 
unreliable  because  of  a  lack  of  control  in  the  specification  of  the  neuron 
(such  as  noise  in  the  weights) 

^The  lermmolog)  threshold  function,  although  standard  in  the  probabilistic 
method,  is  a  trifle  unfortunate  in  the  present  context  of  linear  threshold 
elements.  We  will  replace  it  by  the  term  capacity  function  in  the  next  section 
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are  stated  and  proved  in  Section  III;  two  technical  results  from  large 
deviation  probability  theory  are  confined  to  the  Appendix. 

Several  of  the  arguments  used  involve  asymptotics,  and  we  briefly 
sketch  our  use  of  notation  here.  If  {xn }  and  {p„ }  are  any  two  positive 
sequences,  we  say  that 

(a)  Xn  ~  n(yn)  if  Xn/Vn  B  bouuded  from  below 
Xn  =  0(v»)  if  Xn/Vn  is  boundcd  from  above 

(c)  Xn  ~  Vn  if  XnlVn  approaches  1  for  n  large  enough 

(d)  Xn  =  o(Vr<)  if  x„/i/„  approaches  zero  as  n  -»  oo. 

If  An  is  a  sequence  of  events  in  a  probability  space,  we  say  that  An 
occurs  with  probability  one  if  P{An}  — ‘lasn— *00.  By#,  we 
denote  the  Gaussian  distribution  function 


By  b{k\  N,p),  we  denote  the  probability  that  S  Bernoulli  trials  with 
probabilities p  for  success  and  (1  -  p)  for  failure  result  in  k  successes 
and  N  -  k  failures,  i.e., 

b{k-,  A-.p)=  ( 

We  also  denote  by  B  the  set  {-1, 1}. 


II.  Error  Tolerance 


A  formal  neuron  is  a  linear  threshold  gate  characterized  by  a  vector 
of  n  real  weights  u'  =  (ui, ■  ■  ■  .u'„ )  and  a  real  threshold  uq.  The 
neuron  accepts  as  inputs  points  «  €  R"  and  produces  as  output  a 
single  bit  <f  €  B,  according  to  the  following  rule: 


if  53  ^  “0 

if  <  «o- 


We  will  without  loss  of  generality  assume  that  the  threshold  u  o  =  0.^ 

The  neuron  hence  associates  a  decision  or  classification 
(-1  or  -f  1)  with  each  point  in  n  space.  For  any  given  set  of 
points  then,  the  neuron  dichotomizes  the  set  of  points  into  two 
classes — those  points  mapped  to  +1  and  those  mapped  to  -1.  In 
the  geometric  analog  the  neuron  represents  a  separating  hyperplane 
in  n  space.  We  are  interested  in  characterizing  the  largest  set  of  (joints 
for  which  there  exist  choices  of  weights  such  that  almost  any  set  of 
decisions  associated  with  the  $()ecified  set  of  (Mints  is  realizable. 

Let  R"  be  a  random  m  set  of  (joints  chosen  from 

any  joint  distribution  invariant  to  reflection  of  com(>onents  around  the 
origin  and  such  that  every  subset  of  n  points  is  linearly  inde(Kndent 
with  probability  one.  Let  </'.-■■.  rf'"  6  B  be  a  corTes(XJnding  m  set 
of  decisions.  For  a  neuron  to  make  reliable  decisions,  we  must  find 
a  choice  of  weights  that  realize  the  assignments  «"  -*  d°  for  each 
Q  =  1.  •  •  ■ , m.  Note  that  we  can  uke  all  the  decisions  d°  to  be  +1 
without  any  loss  of  generality  because  the  vectors  can  always  be 
reflected  about  the  origin. 

We  incorporate  error  tolerance  in  decisions  by  introducing  the 
notion  of  “don’t-care  decisions. "  Let  0  <  r  <  1/2  be  the  fraction  of 
enors  that  we  are  willing  to  allow  in  the  decisions  (i.e.,  we  tolerate 
tm  errors  in  decisions).  Let  .  o  =  1.  •  •  • .  m  be  the  outcomes  of  m 
identical  and  inde(jendenl  experiments  whose  outcomes  are  subsets 
of  {-1, 1).  and  such  that 


{</“}  with  probability  1  —  2( 

{-1.1}  with  probability  2f. 


^In  fact,  it  is  easy  to  see  that  (be  threshold  can  be  accommodated  by 
allowing  an  additional  constant  input  of  -1  10  a  zero  threshold  neuron  Our 
results  will  then  continue  to  hold  for  nonzero  thresholds  by  replacing  n  by 
n  +  1. 


If  a  sample  outcome  JD“  =  {d“},  then  we  require  that  the  neuron 
produces  the  specified  decision  d°  as  out{)ut  whenever  it  receives 
as  in|Mit.  If,  however,  the  sample  outcome  D"  =  B,  then  we 
associate  a  don’t-care  decision  with  point  the  neuron  can  result 
in  either  -1  or  1  as  output  when  is  input.  We  call  D“  the 
decision  set  associated  with  decision  d";  we  say  that  D°  is  normal 
if  =  {d°}  (i.e.,  the  decision  has  to  be  accurate),  and  D“  is 
exceptional  if  D”  =  B  (i.e.,  the  decision  is  don't-care).  The  idea 
behind  defining  the  decision  sets  in  this  fashion  is  the  following.  With 
the  m  decision  sets  Z7' ,  ■  ■  ■ ,  Z)’"  generated  independently  according 
to  the  above  prescription,  the  expected  number  of  normal  decision  sets 
is  (1  —  2f)m,  whereas  the  expected  number  of  exceptional  decision 
sets  is  2rm.  We  now  forget  about  those  points  corTes{x>nding  to  the 
exceptional  decision  sets  and  attempt  to  find  neural  weights  that  will 
correctly  classify  the  remaining  points  corresponding  to  the  normal 
decision  sets.  If  we  can  successfully  do  this,  then,  beside  the  points 
corresponding  to  the  poimal  decision  sets,  on  average,  one  half  of 
the  points  corresponding  to  the  exceptional  (don’t-care)  decision 
sets  will  serendipitously  also  turn  out  to  be  correctly  classified. 
(Because  the  weights  were  chosen  without  taking  the  exceptional 
points  into  consideration,  one  half  of  them,  on  average,  will  be 
correctly  classified  as  the  points  are  chosen  from  a  distribution  that  is 
invariant  to  reflections  about  the  origin.)  Thus,  the  expected  number 
of  errors  in  decision  will  only  be  rm.  We  formalize  this  notion  of  a 
random  error  protocol  in  the  following: 

Definition  2.1:  Let  u  1 .  - .  u  „  6  R  be  the  weights  corresiwnding 
to  a  neuron.  We  say  that  the  neuron  makes  r-reliable  decisions  on 
if 


€  D°. 


o  =  1.  •  •  • .  m. 


Note  that  by  the  Borel  strong  law,  the  fraction  of  don’t-care 
decisions  is  almost  surely  2f.  Further,  because  the  vectors  «°  are 
invariant  to  reflections  about  the  origin,  the  fraction  of  actual  decision 
errors  that  occur  for  a  neuron  making  (-reliable  decisions  is  almost 
surely  (.  The  case  r  =  0  reverts  to  the  case  of  (jerfect  decisions. 

In  the  random  error  protocol,  we  are  interested  in  the  following 
attribute  of  the  m  set  of  (joints  and  the  corresponding 

decisions: 


EVENT  J’An.m):  There  is  a  choice  of  weight  such  that  the 
neuron  makes  (-reliable  decisions. 


The  attribute  7”,  (n.m)  deals  with  the  notion  of  reliable  decisions 
on  a  random  subset  of  points  of  expected  size  (1  -  2()m.*  The 
average  number  of  errors  allowed  within  this  protocol  is  tm.  but 
it  is  conceivable,  albeit  a  rare  occurrence,  that  the  actual  number 
of  enors  substantially  exceeds  tm .  In  the  exhaustive  error  protocol, 
however,  it  is  not  permitted  that  the  number  of  errors  exceeds  tm 
substantially.  There  is  no  constraint  however,  on  the  choice  of  which 
(1  -()ni  decisions  are  to  be  correctly  implemented,  and  we  are 


'The  method  of  choosing  decision  sets  advocated  in  this  paper  is  not 
sacrosanct,  and  we  could  utilize  any  random  strategy  for  choosing  decision 
sets  that  yields  an  expected  number  of  (1  -  2f)m  normal  decision  sets 
We  could,  for  instance,  choose  (he  random  subset  of  normal  decision  sets 
from  (he  uniform  distribution  on  all  subsets  of  size  ( 1  —  2(  )m  from  the  set 
of  m  decisions:  alternatively,  we  could  replace  the  independent  assignment 
of  don't-cares  in  this  paper  by  a  Markovian  strategy  The  choice  of  the 
binomial  disinbulion  for  specifying  don’t-cares  in  this  paper  was  motivated 
in  pan  because  it  is.  in  a  sense,  natural  —  the  independent  assignment  of 
don't-cares  from  decision  10  decision  avoids  any  bias  due  to  prior  don'tore 
assignments  —  and  the  fact  that  the  computational  complexity  of  chrxjsing 
decision  sets  reduces  to  an  exercise  in  coin  flipping  The  results  of  the  next 
section  hold  for  most  reasonable  choices  of  underlying  distribution,  resulting 
in  an  expected  number  of  (1  -  2t)m  normal  decision  sets,  although  the 
technical  details  in  the  proofs  can  alter  slightly 
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free  to  exhaustively  check  each  alternative  of  making  ern  errors  and 
choose  the  most  favorable  one.  This  protocol  leads  to  a  consideration 
of  the  following  attribute  of  the  rn  set  of  points  and  the 

corresponding  decisions. 

EVENT  Q,(n,m):  There  is  a  choice  of  weights  such  that  the 

neuron  makes  no  more  than  fm[l  +  o(l)l  decision  errors. 

The  attribute  ^<(n,m)  is  somewhat  stronger  than  the  attribute 
^t(n,m);  instead  of  attempting  to  realize  a  random  subset  of 
decisions,  we  query  whether  it  is  possible  to  find  a  choice  of  weights 
for  the  neuron  such  that  at  least  one  of  the  {^)  subsets  of  (1  —  e)m 
decisions  is  reliably  made. 

Definition  2.2:  Let  .4(n,  m)  be  an  attribute  of  the  m  set  of  points 
A  sequence  {C{n)}“=i  is  a  capacity  function  for  the 
attribute  A(n,m)  if  for  A  >  0  arbitrarily  small:  1)  P{^(n,m)}  — ► 
1  as  n  — *  30  whenever  m  <  (1  -  A)C(n).  and  2)  P{^(n,m)}  — * 
0  as  n  -*  30  whenever  rn  >  (1  +  A)C(n). 

We  say  that  C(n)  is  a  lower  capacity  function  if  it  satisfies  the  first 
condition  and  that  C(n)  is  an  upper  capacity  function  if  it  satisfies 
the  second  condition. 

The  term  threshold  function  is  more  standard  in  the  literature  of 
the  probabilistic  method  when  an  attribute  exhibits  such  a  threshold 
behavior.  Our  definition  is  slightly  stronger  than  is  usual. 

Capacity  functions  have  been  found  for  a  variety  of  neural  network 
architectures  and  algorithms  [12],  (7]-[13].  These  investigations 
into  network  capacity  have  hitherto  concentrated  mainly  on  capacity 
functions  for  perfect  decisions  with  no  errors  (cf.  [9],  [10],  however, 
for  results  on  error  tolerance  in  the  outerproduct  algorithm).  In  the 
following,  we  expand  on  the  results  in  [12]  and  show  capacity 
functions  for  the  attributes  Tt(n.m)  and 

III.  Capactty  Functions 
A.  Error-Free  Decisions 

We  begin  by  investigating  the  attribute  Foin.m)  that  there  is 
a  neuron  that  makes  reliable  decisions  «“  d®  for  each  a  = 
l,---,m.  The  following  fundamental  result  is  due  to  Schlafli  [4]; 
(Wendel  [5]  has  a  more  accessible  proof  of  the  result;  cf.  also  Cover 

I6j.) 

Lemma  3.1:  The  probability  that  there  is  a  neuron  that  makes 
reliable  decisions  on  a  random  m  set  of  points  in  Euclidean  n  space 
R"  is  given  by 

n  —  I 

P{.Po(n,  m)}  =  Y,  ni  -  1, 0.5).  (1) 

k=0 

An  application  of  Lemma  A.2  directly  yields  the  following; 

Theorem  3,2:  The  sequence  Co(n)  =  2n  is  a  capacity  function 
for  the  attribute  /o(n,m).’ 

Proof:  Fix  0  <  A  <  1  and  let  H(z),0  <  x  <  1  denote  the 
binary  entropy  function  defined  in  Lemma  A.2.  With  a  choice  of 
m  =  [2n(l  -  A)J  in  (1),  we  can  find  1/2  <  c  <  1  such  that 

c(m-I) 

P{F'o{n,m)}  >  Y,  **(*1  rn  -  1,0.5) 

k=0 

>  1  _2-(i-H(r))(n.-i)  _  1,  „_oo 

with  the  second  inequality  following  from  Lemma  A.2.  Similarly,  for 
a  choice  of  m  =  [2n(l  +  A)J,  we  can  find  0  <  a  <  1/2  such  that 

«(m  — I) 

P{Fo(n,m)}<  Y  1,0.5) 

*=0 

<2->'-"<">l<”-‘)-0,  n-oo 

’cf.  also  a  recent  result  due  to  Furedi  [16]  on  random  polytopea  in  the  cube. 


where  we  have  used  the  dual  form  of  Lemma  A.2  (cf.  remarks 
following  the  lemma).  ■ 

B.  Epsilon  Reliability 

We  now  investigate  the  attribute  jF,(n,m)  that  there  is  a  choice 
of  weights  for  which  a  neuron  makes  e-reliable  decisions  on  the  m 
set  of  points 

Theorem  33:  The  sequence  C,(n)  =  2n/(l  —  2e)  is  a  capacity 
function  for  the  attribute  7'((n,m). 

Proof:  Let  0  <  e  <  1/2  be  the  given  tolerance.  Noting  that  the 
decision  sets  are  generated  independently  of  the  m  set  of  points,  a 
direct  application  of  Lemma  3.1  yields 

m 

P{,F.(n,m)}  =  5])6(it;  m,  1  -  2e)P{/-o(n,  ik)}.  (2) 

4=0 


We  first  claim  that  P{/'o(r>,k)}  is  a  monotone  nonincreasing 
function  of  k  for  each  positive  integer  n.  To  show  this,  consider 
the  difference  P{.Po(ri.k)}  -  P{Fo(n,k  -F  1)}  for  any  choice  of  k 
and  n.  Using  (1)  and  elementary  binomial  identities,  we  have 


P{:ro(r.,]k)}  -  P[J^o(rr.l:-l-l)} 


=  2'* 

=  2‘* 
=  2”* 


S(‘:')-S(:)| 


Hence,  P{Fo(n.k)}  -  P{Po(n,k+l)}  >  0  for  any  choice 
of  k  and  n,  and  the  claim  is  proved.  Fix  parameters  A  >  0  and 
1/2  <  u  <  2/3.  Now  choose  m  =  2n(l  -  A)/(l  -  2t)  and  set 

t„,  =  m(l  -  2e)  +  m*'v/2f(l  -  2e). 


Using  the  monotonicity  of  P{Fo(n,k)}  and  (2),  we  obtain 
P{,F.(n,m)}  >  P{Po(n,f„.)}  m,l-2f). 


B 

As  n  — »  DC,  we  then  have  from  Theorem  3.2  that 


A  =  P{Pt>{n,  L2n{l  -  A)(l  -I-  o(l))J)}  1 


while  an  application  of  Lemma  A.l,  and  the  choice  1/2  <  u  <  2/3, 
yields 


for  some  positive  constant  ci .  Hence,  for  every  A  >  0 
P{.P,(n,L2n(l  -  A)/(l -2f)J)}  -  1,  as  n 


Now  choose  m  =  2n(l  -F  A)/(l  —  2f),  and  set 
=  m{l  -  2f)  -  -  2f). 

Again,  using  the  monotonicity  of  P{.Po{n,fc)}  in  (2),  we  have 

*  m 

P{.r.{r»,m)}  <  m,l  -  2()-k  P{Fo(n.  x„)). 


c 

An  application  of  Lemma  A.l  and  the  choice  1/2  <  u  <  2/3  yields 
that  as  n  — >  oo 
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for  some  positive  constant  ci.  In  addition.  Theorem  3.2  yields 

D  =  P{Mn,  L2n(l  +  A)(l  -  o(l))J)}  -  0. 

It  hence  follows  that  for  every  choice  of  A  >  0 

P{/-.(n.  L2n(l  +  A)/{1  -  2r))J}  -  0,  as  n  oo. 

Thus,  the  sequence  2n/(l  -  2f)  is  both  a  lower  and  an  upper  capacity 
function  (hence,  a  capacity)  for  the  attribute  P,(n,m).  ■ 

Let  us  now  consider  attribute  d((n,m).  The  Borel  strong  law  of 
large  numbers  yields  that  the  fraction  of  exceptional  decision  sets 
is  no  more  than  2e[l  +  o(l)]  with  probability  one.  For  any  choice 
of  weights  realizing  e-reliable  decisions  on  the  m  set  of  points 
it  follows  that  the  fraction  of  decision  errors  will  be 
no  more  than  <[1  +  o(l)|  with  probability  one  as  the  m  set  of  points 
is  chosen  from  a  joint  distribution  invariant  to  reflections.  The  proof 
of  the  above  theorem  then  directly  yields 
Theorem  3.4:  The  sequence  ^(n)  =  2n/(l  -  2e)  is  a  lower 
capacity  function  for  the  attribute  $<(n,  m). 

Now,  consider  all  possible  ways  of  assigning  almost  rm[l  +  o(l)l 
errors  in  decision  to  the  m  set  of  points.  Attribute  S,(n,m)  is 
realized  if  at  least  one  of  these  possibilities  is  realizable  for  some 
appropriate  choice  of  weights.  (This  clearly  relaxes  the  more  stringent 
requirements  of  attribute  J',  (n,m),  where  we  require  that  a  randomly 
chosen  one  of  these  possibilities  is  realizable  for  an  appropriate  choice 
of  weights.)  Because  the  number  of  such  possibilities  is  exponential, 
we  might  hope  to  realize  significant  gains  in  capacity  for  attribute 
Q,{n,m).  The  following  result  shows  that  we  can  hope  to  gain  by, 
at  best,  a  constant  scale  factor  but  that  the  linear  rate  of  growth  of 
the  capacity  function  is  unaffected. 

Theorem  33:  Let  a,  be  a  function  of  the  error  tolerance  e  defined 
by  the  unique  solution  of 

H(^^)  +  ^(f)=l,  0<<<l/2  (3) 

where  ^  is  the  binary  entropy  function.  Then,  the  sequence  ^(n)  = 
2A,n/(l  -  2()  is  an  upper  capacity  function  for  the  attribute 
G,{n.m). 

Remarks:  The  function  n,  is  defined  as  we  vary  «  in  the  interval 
0  <  <  <  1/2  and  monotonically  increases  from  a  value  of  +1  at 
r  =  0  to  a  value  close  to  505  as  f  approaches  1/2.  For  small  e,  it 
remains  close  to  +1  so  that  the  capacity  function  for  the  attribute 
Q,(n,m)  still  behaves  like  2n/(l  —  2f). 

Proof:  Let  us  consider  a  tolerance  of  exactly  cm  errors. 
The  proof  is  not  materially  altered  if  the  allowed  tolerance  is 
rT7i(l  +  o(l)]. 

Let  P(n,  m)  denote  the  probability  that  there  is  a  choice  of  weights 
for  which  a  given  set  of  j  decisions  is  made  incorrectly,  whereas  the 
remaining  m  —  j  decisions  are  made  correctly.  By  Lemma  3.1,*  we 
get  that  for  any  choice  of  j 

n  — I 

P(n.m)  =  ^  6(*;  m  -  1,0.5).  (4) 

Now,  the  number  of  ways  in  which  tm  or  fewer  decision  errors  can 
be  made  is  (  7  )  ’  **>6  union  bound  gives 

<m 

P{c;,(f»,m)}  <  P(n,m)^ 

j=o 


^Fixing  a  set  of  .7  points  «i°> ,  -  which  are  incorrectly  classified,  is 

equivalent  to  specifying  the  corresponding  decisions  to  be  -d"* .  -  - ,  — ePr ; 
this,  however,  just  yields  a  different  dichotomy  of  m  points  in  n  space,  and 
Schlafli's  lemma  applies. 


Let  A  >  0  be  fixed  but  arbitrary,  and  choose  m  =  2A,n(l  +  A)/ 
(1  —  2e),  where  a,  is  as  defined  in  (3).  Applying  Lenuna  A.2  to  (4), 
we  obtain  that 

P(n,m)  < 

while  another  application  of  Lemma  A.2  gives 

<m  /  >. 

^  I  < 

j=0  \  / 

for  large  enough  m.  Hence,  for  each  choice  of  0  <  e  <  1/2  and 
A  >  0,  there  is  a  choice  of  ^  >  0  such  that 

P{5,(n,m)}  < 

The  binary  entropy  function  H(e)  increases  monotonically  from  a 
value  of  0  at  c  =  0  to  a  value  of  1  at  c  =  1/2.  Hence,  with  a,  as  in 
(3),  if{(l  -  2f)/2A,(l  +  A)}  +  ff(f)  <  1.  Hence,  for  every  choice 
of  A  >  0  and  m  =  2A,n(l  +  A)/(l  -  2f),  there  is  a  choice  of  6  >  0 
such  that  P{P,(n,m)}  <  2“*”  0  as  rt  — »  oo.  ■ 

rv.  Conclusion 

The  results  proved  in  this  paper  demonstrate  that  a  formal  neuron 
has  a  computational  capacity  that  is  linear  in  n  and  that  this  rate  of 
growth  of  capacity  persists  even  when  errors  are  tolerated  in  the  deci¬ 
sions.  A  question  that  arises  at  this  juncture  is  how  this  result  bears  on 
computations  involving  networks  of  formal  neurons.  In  particular,  for 
an  associative  memory  model  composed  of  n  densely  interconnected 
formal  neurons,  the  rigorous  determination  of  the  maximal  storage 
capacity  when  errors  are  permitted  in  recall  is  an  open  question.  We 
analyze  this  in  a  subsequent  paper  [17]  (cf.  also  [2]). 


Appenddc 
Large  Devtaiions 

We  quote  two  technical  lemmas  from  large  deviation  probability 
theory  that  we  will  need.  Both  results  concern  probabilities  in  the  tails 
of  the  binomial  distribution.  Lemma  A.1  provides  a  good  uniform 
estimate  for  the  cumulative  distribution  of  a  sum  of  N  independent 
(0, 1)  random  variables  valid  for  deviations  from  the  mean  as  large 
as  o(A'*^*)  (instead  of  the  Of>/lVj  deviations  encountered  in  the 
usual  central  limit  theorem).  Ine  approximation  is  the  strong  form 
of  the  large  deviation  central  limit  theorem  [14],  Lemma  A.2  is 
due  to  Chemoff  [15]  and  estimates  probabilities  in  the  extreme 
tails  (deviations  of  the  order  of  A'  from  the  mean)  of  the  binomial 
distribution. 

Lemma  A.1:  Let  0  <  p  <  1,  and  let  (i'n)  be  a  sequence  such 
that  jr.v  -  A>|  <  A'(A)  =  Then 


£6(k;A’,p) 


.4 


N 


If,  in  addition,  vs  -  A'p  =  n(A*'),  for  some  1/2  <v<  2/3,  then 


26(ik;Ai,p)  =  l-0(e-‘'"*‘'") 

t=o 

with  ^  >  0.  which  is  a  constant. 

Lemma  A3:  Let  0  <  p  <  1  be  fixed,  and  let  Tp  and  H  be  real¬ 
valued  functions  on  the  closed  interval  [0, 1]  defined  for  0  <  c  <  1 
by 


Tp(c)  =  -C  logjP  -  (1  -  c)logj  (1  -  p) 
H(c)  =  -c  logj  c  -  (1  -  c)\ogi  (1  -  c).  ’ 


’We  define  H(c)  =  0  when  c  =  0  or  c  =  1 
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Then,  for  every  choice  of  c  6  (p.  1)  we  have 

^  6{fc:.V,p)  >  1  - 
*=o 

Remarks:  The  quantity  H  is  the  binary  entropy  function.  Note 
that  ^^(c)  >  H(c)  for  all  c  (with  equality  only  for  c  =  p)  so  that 
Chemoff’s  bound  yields  an  exponentially  small  probability  for  the 
extreme  tails.  The  bound  is,  in  fact,  exponentially  tight. 

For  the  special  case  p  =  1/2,  Cheraoff’s  bound  yields 

LcNJ 

^  5(ib;  AT,  0.5)  >  1  - 

k=0 

for  any  choice  of  1/2  <  c  <  1.  Note  also  that  by  the  symmetry  of 
the  binomial  distribution,  we  have 

for  any  choice  of  0  <  a  <  1/2. 
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Abstract 

How  does  an  allowed  tolerance  for  error  in  output  affect  the  computational  capability  of  a  neural 
network  and  the  ability  of  the  network  to  learn  an  underlying  problem  structure?  The  subject  of  this 
communication  is  the  development  of  formal  protocols  for  handling  error  tolerance  which  allow  of  a  precise 
determination  of  the  computational  gains  that  may  be  expected.  The  error  protocols  are  illustrated  in  the 
framework  of  a  densely  interconnected  neural  network  architecture  (with  associative  memory  the  putative 
application),  and  rigourous  calculations  of  capacity  are  shown.  Explicit  capacities  are  also  derived  for 
the  case  of  feedforward  neural  network  configurations. 


'Presented  in  part  at  the  WoTk$hop  on  Nexral  Netviorkt  Jor  Compnling,  Snowbird,  Utah,  1986  [1). 
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1  INTRODUCTION 

How  does  an  allowed  tolerance  for  error  in  output  affect  the  computational  capability  of  a  neural  network 
and  the  ability  of  the  network  to  learn  an  underlying  problem  structure?  In  this  paper  we  attempt  to 
mathematically  codify  the  computational  gains  that  are  realised  when  errors  are  permitted  in  the  network 
output.  Such  error  tolerance  may  occur  naturally  in  applications  such  as  associative  memory  or  pattern 
classification  where  we  do  not  insist  on  accurately  classifying  every  feature;  alternatively,  error  tolerance 
may  be  forced  upon  us*  by  the  inability  of  the  neural  architecture  under  consideration  to  respond  accurately 
to  every  instance  of  a  specific  problem:  as  in,  for  example,  attempting  to  classify  two  non-linearly  separable 
classes  of  points  with  a  single  separating  plane  (or  equivalently,  a  single  McCulloch-Pitts  neuron).^ 

Anecdotal  evidence  exists  for  the  premise  that  an  allowed  error  tolerance  can  have  a  significant  effect 
on  computational  capability.  Consider,  for  instance,  an  associative  memory  apphcation  where  it  is  desired 
to  store  memories  as  attractors  in  recurrent  neural  networks  whereby  a  linear  number  of  component  errors 
in  any  memory  are  corrected  and  the  memory  retrieved.  Rigourous  investigations  by  McEliece,  et  al  [2]  and 
Komlos  and  Paturi  [3]  show  that  the  outer-product  algorithm  for  storing  memories  in  a  recurrent  network 
of  n  neurons  stores  exactly  of  the  order  of  n/logn  memories  when  it  is  required  that  each  memory  be 
retrieved  exactly  with  no  component  errors.  However,  in  earlier  work,  Hopfield  [4]  reports  empirical  evidence 
indicating  that  it  may  be  possible  to  store  a  number  of  memories  linear  in  n  with  the  outer-product  algorithm 
if  errors  are  permitted  in  the  retrieved  memories;  this  was  formally  verified  subsequently  by  Newman  [5]  who 
demonstrated  a  lower  bound  linear  in  n  on  the  memory  storage  capacity  if  errors  are  permitted  in  retrieval. 
Thus,  an  allowed  error  tolerance  effects  a  substantial  improvement  in  storage  capacity  from  sub-linear  in  n 
to  at  least  linear  in  n  for  the  outer-product  algorithm. 

Our  purpose  here  is  to  attempt  to  quantify  the  maximal  gains  in  capacity  that  can  accrue  for  any 
algorithm  if  errors  are  allowed  in  the  outputs  of  a  given  neural  network  architecture.  In  particular,  for 
recurrent  neural  networks  of  n  neurons  (and  any  choice  of  algorithm)  we  settle  the  issue  of  whether  Newman’s 
linear  lower  bounds  can  be  substantially  improved  upon  to  allow  memory  storage  capacities  which  increase 
faster  than  linearly  in  n  if  errors  are  permitted  in  the  recoil  of  the  stored  memories. 

Let  us  consider  the  associative  memory  application  in  some  more  detail.  Let  u* ,  . . . ,  u"*  be  a  random 

*Soine  men  are  bom  into  errora,  others  achieve  errors,  and  some  have  errors  thrust  upon  ’em. 

’In  the  classical  modeling  of  McCulloch  and  Pitts,  a  formal  neuron  is  a  linear  threshold  element  which  produces  a  binary 
output  according  to  the  sign  of  a  linear  form  of  the  inputs. 
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selection  of  m  memories  drawn  from  the  vertices  of  the  cube  {—1,1}”.  For  associative  memory  it  is  desired 
that  a  random  probe  differing  from  a  memory  in  no  more  than  pn  components  (for  some  choice  of  0  <  p  <  1/2) 
is  mapped  into  the  memory.  In  other  words,  we  require  the  memories  to  be  attractors  over  a  Hamming  ball 
of  reidius  pn. 

Consider  now  a  recurrent  neural  network  of  n  neurons  where,  at  any  epoch,  the  binary  n-tuple  of  neural 
outputs  is  fed  back  to  constitute  the  neural  inputs  for  the  next  epoch.  The  degrees  of  freedom  in  this 
dynamical  system  reside  in  the  specification  of  the  weighting  factors  linking  neural  outputs  to  inputs.  Each 
specification  of  interconnection  weights  results  in  the  specification  of  a  dynamics  characterised  by  a  family 
of  trajectories  in  a  state  space  of  the  vertices  of  the  cube  {—1, 1}"-  The  storage  of  memories  as  attractors 
in  such  a  structure  is  accomplished  if,  for  a  choice  of  interconnection  weights,  trajectories  originating  in  a 
Hamming  ball  of  radius  pn  at  the  memories  ultimately  terminate  at  the  memories.  As  we  will  see  later  in 
Corollary  4.2,  it  is  not  possible  to  store  more  than  a  linear  number  (in  n)  of  random  memories  as  attractors 
in  a  recurrent  neural  network  of  n  neurons. 

An  allowed  error  tolerance  in  this  context  permits  a  fraction  «,  0  <  c  <  p  <  1/2  of  errors  in  the  retrieval 
of  any  memory.  This  situation  corresponds  to  requiring  points  in  the  Hamming  ball  of  radius  pn  at  a  memory 
to  be  ultimately  mapped  into  a  (smaller)  ball  of  radius  en  at  the  memory.  This  is  indicated  schematically 
in  Figure  1.  The  capacity  question  is  now  to  determine  the  largest  allowable  rate  of  growth  of  the  number 
of  memories  m  with  the  number  of  neurons  n  such  that  it  can  be  guaranteed  with  high  probability  that 
trajectories  originating  in  a  Hamming  ball  of  radius  pn  at  any  memory  are  ultimately  confined  within  a 
hamming  ball  of  radius  en  at  the  memory.  The  following  plausibihty  argument  indicates  that  the  flexibility 
inherent  in  allowing  error  tolerance  may  result  in  substantial  gains  in  capacity  over  the  error  free  case. 

Let  us  simplify  the  problem  by  considering  trajectories  originating  at  a  memory,  say  u“  €  {  —  1,1}".  If 
the  trajectory  is  to  be  confined  within  an  en  baJl  at  the  memory,  then  clearly  a  necessary  condition  is  that 
the  first  synchronous  step  in  the  trajectory  not  lead  to  a  point  outside  the  ball  of  radius  en  at  the  memory. 
However,  any  transition  u"  •-+  v"  where  v“  differs  from  the  memory  u“  in  no  more  than  en  components 

m 

is  an  admissible  transition  (see  Figure  2).  Thus,  for  the  m  memories,  there  are  a  total  of  (  "  ) 

admissible  m-sets  of  first  synchronous  transitions  originating  at  the  memories  corresponding  to  the  number 
of  different  ways  of  specifying  ‘iirst  transition  points”  in  the  m  Hamming  balls  of  radius  en  at  the  memories 
Let  us  now  concentrate  on  the  problem  of  realising  an  admissible  set  of  m  first  transitions  (one  for  each 
memory)  within  a  recurrent  neural  network  structure.  This  is  clearly  a  necessary  prelude  to  the  larger 
problem  of  associative  memory  with  error  tolerance  in  the  following  sense:  if  M{(.n)  is  the  largest  rate  of 
growth  of  m  with  n  for  which  it  can  be  guaranteed  with  high  probability  that  there  exists  a  set  of  neural 
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interconnection  weights  for  a  recurrent  neural  network  which  yields  an  admissible  m-set  of  first  synchronous 
transitions  originating  at  the  memories,  then  M{e,  n)  gives  an  upper  bound  for  the  number  of  memories  that 
can  be  stored  with  an  error  tolerance  of  en  components  in  retrieval. 

Our  basic  question  now  is  the  following:  For  a  random  selection  of  m  memories,  u“  €  {—1,1}",  a  =  1, 
. . . ,  m,  and  a  given  (fractional)  error  tolerance  0  <  f  <  1/2,  is  there  any  choice  of  neural  interconnection 
weights  for  a  recurrent  neural  network  which  will  result  in  an  admissible  m-set  of  first  synchronous  transitions, 
I— t  v“,  a  =  1,  . . . ,  m,  where  each  v“  differs  from  the  corresponding  memory  u“  in  no  more  than  en 
components?  Now  it  is  easy  to  verify  using  Stirling’s  formula  that  the  number  of  admissible  m-sets  of  first 
synchronous  transitions  from  the  memories  is  12(2^'^"*"),  where  c(€)  is  a  fixed  positive  function  of  f.  For 
a  large  choice  of  m,  any  given  m-set  of  admissible  transitions,  u*  v*,  . . . ,  u"*  v"*,  will  have  only  a 

small  probability  of  being  realised  within  a  recurrent  neural  network  architecture.  However,  there  are  an 
exponential  number  of  possible  sets  of  admissible  transitions  so  that  the  probability  that  one  or  more  of 
these  sets  of  transitions  is  realisable  in  a  recurrent  network  may  be  large  even  if  the  probability  of  realising 
any  individual  set  of  transitions  is  small.  It  hence  appears  that  potentially  large  gains  in  capacity  may  be 
possible  if  errors  are  allowed  in  retrieving  memories. 

A  similar  plausibility  argument  could  be  made  for  the  improvement  in  capacity  of  any  neural  network 
architecture  if  a  fraction  e  of  the  network  outputs  can  be  in  error.  For  instance,  if  we  are  interested  in  realising 
a  set  of  desired  assignments  u“  /(u®),  o  =  1,  ...,  m  (where  /  is  some  underlying  function  from  which 
random  examples  are  drawn)  in  a  feedforward  neural  eirchitecture,  allowing  up  to  cm  of  these  assignments 
to  be  in  error  again  gives  an  exponential  (in  m)  number  of  choices  for  specifying  incorrect  assignments. 

The  above  specious  arguments  notwithstanding,  the  main  results  of  this  paper  indicate,  however,  that 
error  tolerance  does  not  buy  order  of  magnitude  improvements  in  capacity  over  the  error  free  case  in  neural 
network  architectures:  in  general,  error  tolerance  results  in  an  improvement  in  the  multiplicative  constants, 
but  does  not  change  the  rate  of  growth  of  capacity. 

Organisation:  In  the  next  section  we  set  up  the  formed  neural  model  in  the  framework  of  a  fully- 
interconnected  network  architecture  for  definiteness.  We  also  introduce  two  protocols  for  error — a  random 
and  aui  exhaustive  error  protocol — and  define  the  formal  notion  of  capacity.  The  definitions  here  follow 
those  developed  in  Venkatesh  and  Psaltis  [6]  in  the  analysis  of  reliability  and  error  tolerance  issues  for 
computations  with  a  single  neuron.  In  Section  3  we  state  some  preliminary  technical  lemmas  which  are 
central  to  the  ensuing  development.  In  Section  4  we  state  and  prove  the  main  theorems  on  the  capacity 
of  fully-interconnected,  recurrent  neir  I  networks  when  there  is  a  tolerance  for  output  error.  Finally,  in 
Section  5  we  briefly  indicate  the  extensions  of  these  results  to  feedforward  neural  network  architectures 
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On  Notation:  B  denotes  the  set  {—1, 1};  the  function  sgn  :  IR  — ►  B  is  defined  by  sgna;  =  i/|r|  if  a;  ^  0, 
with  the  nonce  convention  sgnO  =  1;  logarithms  are  to  base  e;  for  any  positive  integer  k,  [it]  denotes  the 
set  {1, ,  fc};  Cl,  C2,  C3,  ...  represent  absolute  positive  constants;  for  any  u  e  B”  and  any  r  >  0,  B(u,  r) 
denotes  the  Hamming  ball  of  radius  r  at  u,  i.e.,  the  set  of  all  points  in  B”  which  differ  from  u  in  r  or  fewer 
components;  the  probability  that  N  Bernoulli  trials  with  probabilities  p  for  success  and  1  —  p  for  f2iilure 
result  in  k  successes  and  N  —  k  failures  is  denoted  by  b(k,N,p): 

b{k-,N,p)=(^  ^  )  p‘(l -?)""-*■ 

We  will  also  have  recourse  to  the  following  asymptotic  order  notations.  If  {i„}  and  {j/„}  are  positive 
sequences,  we  denote:  a:„  =  0(y„)  if  there  exists  K  <  oo  such  that  Xn/l/n  <  N  for  all  n;  i„  =  ft(yn)  if  there 
exists  L  >  0  such  that  Xn/l/n  >  ^  for  all  n;  and  x„  =  o(y„)  if  x„/y„  — ►  0  as  n  — ►  oo. 


2  ERROR  TOLERANCE:  PROTOCOLS 

2.1  Formal  Neurons  and  Networks 


A  formal  neuron  is  a  linear  threshold  element  accepting  n  inputs  which  produces  a  binary  output  according 
to  the  sign  of  a  linear  form  of  the  inputs.  In  particular,  a  neuron  is  characterised  by  a  vector  of  n  real  weights, 
w  =  (u;i  •  •  •  w„),  and,  given  as  input  a  vector  u  =  («i  •  >  •  Un)i  produces  a  binary  output  v  =  sgn  ’^i**i  * 
The  neuron  hence  associates  a  decision  or  classification,  -1  or  +1,  with  each  point  in  n-space.  For  any  given 
set  of  points  then,  the  neuron  dichotomises  the  set  of  points  into  two  classes — those  points  mapped  to  +1  and 
those  mapped  to  -1.  In  the  geometric  analogue  the  neuron  represents  a  separating  hyperplane  in  n-space. 
Let  U  =  . . .  ,  u”*}  be  an  m-set  of  points  in  n-space,  and,  corresponding  to  each  u“  €  U,  let  u®  G  B  be 

a  desired  decision.  The  m-set  of  decisions  naturally  dichotomises  U  into  two  sets  (U'*',  U“),  where  u“  G  U"*" 
if  =  1  and  u“  G  U“  if  v“  =  —1.  We  say  that  the  dichotomy  (U''',U~)  is  separable  by  a  neuron  iff  there 
exists  a  weight  vector  w  G  B."  such  that 


if  u®  =  (u® 
if  u®  :r  (u® 


•u“)GU+. 

•<)eu-. 


i.e.,  sgn  Wiuf  =  v®  for  a  =  1,  . . . ,  m. 

We  will  be  concerned  here  with  a  fully- interconnected,  recurrent  network  oItj  neurons  where  the  neural 
outputs  at  any  epoch  are  fed  back  to  constitute  the  neural  inputs  for  the  next  epoch  In  particular,  for 
any  i  G  [n],  neuron  i  is  characterised  by  a  set  of  n  —  1  weights  {tuj;  ■  j  ^  i<j  £  [n]},  and  if  the  vector 


^Thii  ii  the  model  of  McCulloch  and  Pitta.  A  real  threshold  is  allowed  within  the  model  but  is  not  critical  to  the  present 
discussion. 
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u  =  (ui  ■  ■  ■  Un)  €  B"  denotes  the  neural  outputs  at  any  epoch,  at  the  next  epoch  the  ith  neuron  then 
produces  a  binary  output  uj  =  sgn52j^,  tu,jUjJ  The  system  hence  evolves  (synchronously)  in  a  state 
space  of  vertices  of  the  cube  B",  and  is  completely  characterised  by  the  zero-diagonal  matrix  of  neural 
interconnection  weights  [lUtj]. 

Let  u* ,  . . . ,  u”*  be  a  random  m-set  of  memories  chosen  independently  from  the  vertices  of  the  cube.  In 
particular,  the  memory  components  {«“  ;  i  6  [n],a  €  [m]}  are  i.i.d.  random  variables  drawn  from  a  sequence 
of  symmetric  Bernoulli  trials: 

P{uf  =  1}  =  P{uf  = -1}  =  1/2,  i€[n],  a€[m]. 


In  an  associative  memory  application,  a  basic  deisderatum  would  be  that  all  the  memories  are  fixed  points — 
u“  t— ►  u“,  o  =  1,  . . . ,  m — of  the  network,  i.e.,  that  there  exists  a  zero  diagonal  matrix  of  weights  [ty.j]  such 
that 

n 

sgn  ^  WijU^  =  uf,  te[n],  a  €  [m], 
j  =  1 

j  »■ 

If  now  there  is  an  allowed  tolerance  for  error,  the  fixed  point  requirement  for  the  memories  could  be  relaxed 
to  allow  “admissible”  first  synchronous  transitions  of  the  form  u“  t— ►  v“.  By  “admissible”  we  mean  as  before 
that  the  number  of  components  in  which  the  points  v®  €  B"  2ue  allowed  to  differ  from  the  corresponding 
memories  u®  must  be  within  a  given  error  tolerance.  We  are  interested  in  estimating  the  largest  edlowable 
rate  of  growth  of  m  with  n  for  which  there  exists  a  zero-diagonal  network  which  realises  m  admissible 
synchronous  transitions  u®  *-*  v®  with  high  probability  as  n  grows  large.  In  the  following  we  define  two 
formal  error  protocols  which  provide  different  notions  of  “admissibility.” 

2.2  Random  Error  Protocol 


We  begin  by  defining  a  protocol  which  randomly  specifies  which  components  of  a  memory  are  allowed  to  be 
in  error  by  randomly  labeling  a  set  of  memory  components  as  don’t-cares.  Let  0  <  t  <  1/2  be  the  fraction 
of  errors  that  we  are  willing  to  allow  in  the  retrieval  of  2uiy  memory.  For  each  i  €  [n]  let  {£>®  ;  a  G  [m]}  be 
the  outcomes  of  m  identical,  and  independent  experiments  whose  outcomes  are  subsets  of  {  —  1, 1},  and  such 


that 


r  {“?}  probability  1  -  2f, 

(  {-1,1}  with  probability  2e. 


^To  avoid  trivial  complications,  we  asstune  that  there  is  no  self  feedback  from  the  output  of  any  neuron  to  its  input.  This 
corresponds  to  setting  the  weights  w„  =  0  for  t  =  1 ,  . . . ,  n. 
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If  a  sample  outcome  D°  =  {uf},  then  we  require  that  the  »th  neuron  retrieve  the  »th  component  of 
memory  u“;  if,  however,  the  sample  outcome  Df  =  B,  then  we  associate  a  don’t-care  decision  with  point  u“: 
the  neuron  can  result  in  either  -1  or  1  as  output  when  u"  is  input.  We  call  Df  the  decision  set  associated 
with  memory  component  u";  we  say  that  is  normal  if  Df  =  {u“}  (i.e.,  memory  component  uf  has  to 
be  retrieved  by  the  ith  neuron),  and  D°  is  exceptional  if  Z?"  =  B  (i.e.,  the  decision  is  don’t-care).  The 
idea  behind  defining  the  decision  sets  in  this  fashion  is  the  following.  For  each  a  €  [m],  the  decision  sets 
{D9  :  i  e  [n]}  are  generated  independently  according  to  the  above  prescription  so  that  the  expected  number 
of  normal  decision  sets  is  (1  —  2f)n  while  the  expected  number  of  exceptional  decision  sets  is  2en  for  each 
memory.  If  now,  ignoring  components  corresponding  to  exceptional  decision  sets,  we  design  a  zero-diagonal 
weight  matrix  to  retrieve  all  components  corresponding  to  the  normal  decision  sets,  then  on  average  one-half 
of  the  components  corresponding  to  exceptional  decision  sets  will  also  be  retrieved  so  that  the  expected 
number  of  errors  in  the  retrieval  of  any  memory  will  be  only  en. 

Definition  2.1  For  each  i  6  [n],  let  {wij  :  j  /  t}  be  the  set  of  n  —  1  weights  corresponding  to  the  ith  neuron. 
We  then  say  that  the  ith  neuron  makes  (-reliable  decisions  on  the  set  of  memories  {u^,  . . . ,  u'"}  if 

n 

sgn  ^  Wijuf  €Df,  a  =  1, . . .  ,m. 

j  =  1 
j  5^  »■ 

If  all  the  neurons  make  e-reliable  decisions,  then,  by  the  Borel  strong  law,  the  fraction  of  component 
errors  in  the  retrieval  of  any  memory  is  e  almost  surely.  We  are  hence  interested  in  the  following  attribute 
of  the  m-set  of  memories. 

EVENT  Tlt{n,  m)  [Random  Error  Protocol  with  Parameter  (]:  For  each  i  G  [n],  there  is  a  choice  of 
weights  for  the  ith  neuron  such  that  the  neuron  makes  (-reliable  decisions  on  the  set  of  memories 

The  attribute  7J,{n,  m)  deals  with  the  notion  of  random  synchronous  transitions  from  the  memories  by 
specifying  a  random  choice  of  (on  average  tn)  component  errors  for  each  memory.  The  computational  gains 
we  may  expect  from  this  protocol  arise  from  the  large  number  of  “typical”  transitions  that  can  be  specified. 

2.3  Exhaustive  Error  Protocol 

Consider  now  a  protocol  where  the  number  of  component  errors  in  a  memory  in  a  single  synchronous 
transition  is  not  permitted  to  exceed  fn;  however,  there  is  no  specification  or  constraint  on  which  components 
in  a  memory  are  allowed  to  be  in  error,  and  we  are  free  to  examine  all  alternatives  of  specifying  cn  or  fewer 
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component  errors  in  each  memory  in  one  synchronous  step.  This  protocol  leads  to  a  consideration  of  the 
following  attribute  of  the  m-set  of  memories. 

EVENT  St(n,m)  [Exhaustive  Error  Protocol  with  Parameter  e]:  For  each  a  G  [m]  there  exists 
a  vertex  v"  G  B(u“,cti)  such  that  the  set  of  synchronous  transitions  {u®  »-♦  v®  :  a  G  [m]}  is 
realised  for  some  choice  of  zero-diagonal  weight  matrix  [tujj]  for  a  fully-interconnected  network 
of  n  neurons. 

The  attribute  £f(n,m)  is  somewhat  stronger  than  the  attribute  ^,(n,m);  instead  of  specifying  a  random 
set  of  essentially  2en  don’t-care  components  for  a  memory  (resulting  in  essentially  en  component  errors),  we 
are  now  allowed  to  examine  all  transitions  resulting  in  en  or  fewer  component  errors  for  each  memory  and 
choose  the  most  favourable  specification  of  errors.  As  we  saw  in  the  Introduction,  this  allows  us  any  choice 
from  among  an  exponentially  large  number  of  admissible  m-sets  of  first  synchronous  transitions  originating 
at  the  memories. 

2.4  Capacity  Functions 

The  notion  of  capacity  of  a  fully-interconnected  neural  network  that  we  espouse  is,  loosely  speaking,  the 
“largest  number”  of  random  memories  that  can  be  “stored”  in  the  network.  The  precise  meaning  we  attach 
to  “storage”  of  a  memory  depends  upon  the  attribute  of  interest,  such  as:  all  memories  are  fixed  points; 
admost  all  memories  are  attractors  over  a  radius  pn;  there  are  no  more  than  en  component  errors  in  retrieving 
any  memory.  We  will  be  interested  in  particular  in  the  attributes  7Zf(n,m),  the  random  error  protocol  with 
parameter  e,  and  £t(n,  m),  the  exhaustive  error  protocol  with  parameter  e.  The  following  definition  captures 
the  notion  of  “largest  number  of  memories”  as  a  threshold  function  of  a  relevant  attribute.  The  notion  is 
explored  in  somewhat  greater  generality  in  [7]. 

Definition  2.2  Let  ^(n,r7i)  be  an  attribute  of  the  m-set  of  memories  u*,  . . . ,  u'".  A  sequence  {C(n)}^j 
is  a  capacity  function  for  the  attribute  A{n,m)  if  for  A  >  0  arbitrarily  small,  as  n  — ►  oo: 

a)  P{>i(n,m)}  — *  1  whenever  m  <  (1  —  A)C(n); 

b)  P{>l(n,m)}  — •  0  whenever  m  >  (1  A)C(n). 

We  say  that  C(n)  is  a  lower  capacity  function  if  it  satisfies  the  first  condition,  and  that  C(n)  is  an  upper 
capacity  function  if  it  satisfies  the  second  condition. 

Capacity  functions  have  been  found  for  a  variety  of  neural  network  architectures  and  algorithms  (a 
survey  can  be  found  in  Venkatesh  (7]).  These  investigations  into  network  capacity,  however,  have  hitherto 
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concentrated  mainly  on  capacity  functions  for  perfect  decisions  with  no  errors  (cf.  [3,  5],  however,  for  results 
on  error  tolerance  in  the  outer-product  algorithm).  In  the  following  we  expand  on  our  results  in  [1,  6]  and 
show  capacity  functions  for  the  attributes  'JZc(n,m)  and  St(n,m). 

3  TECHNICAL  PRELIMINARIES 

Our  basic  technique  is  to  replace  the  geometrical  notion  of  trajectories  within  Hamming  balls  in  n-space  by 
calculations  involving  the  tails  of  binomial  distributions.  The  following  is  a  classical  result  due  to  Chernoff  [8] 
which  asserts  exponential  bounds  for  the  binomial  tails  for  linear  deviations  from  the  mean. 

Lemma  3.1  Ze<  0  <  p  <  1  6e  /ixed,  and  lei  Tp  and  H  be  real-valued  functions  on  ike  closed  interval  [0, 1] 
defined  for  0  <  c  <  1  by 

Tp(c)  =  -clogp-(l-c)  log(l-p), 

H{c)  =  — c  log  c  —  (1  —  c)  log(l  —  c).' 

Then  for  every  choice  of  c  €  (p,  1]  and  every  integer  N  we  have 

IcNi 

k=0 

Remarks:  H  is  the  binary  entropy  function  which  takes  values  in  [0,  log2].  Note  that  for  any  choices  of  c 
and  p,  Jensen’s  inequality  yields 

^(c)  -  Tp(c)  =  c  log  -  (1  -  c)  log  \ — -  <  log  1  =  0 

c  1  —  c 

with  equality  holding  only  when  c  =  p.  Hence  Tp(c)  >  H{c)  whenever  c  ^  p.  Chernoff’s  bound  hence  yields 
exponentially  small  probabilities  for  the  extreme  tails  of  the  binomial  distribution.  This  bound  can  be  shown 
to  be  exponentially  tight  (see  Blake  and  Darabian  [9],  for  instance). 

For  the  special  case  p  =  1/2,  Chernoff’s  bound  yields 

l«7VJ 

Y,  b(k\N,0.b)  >  1 

k=0 

for  any  choice  1  /2  <  c  <  1 . 


*We  define  H{c)  =  0  when  c  =  0  or  c  =  1. 
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For  any  m-set  of  points  U  G  |U|  =  m,  let  X>(U)  denote  the  family  of  dichotomies  of  U  that  can 
be  separated  by  a  neuron;  a  dichotomy  (U+,U“)  of  U  is  in  I>(U)  if  and  only  if  there  exists  a  weight  vector 
w  =  (lyi  •  •  •  ws)  £  IR^  such  that 


if  u  =  (til  •••  un)  €  U+, 
if  u  =  (til  •  •  •  ujv)  €  U“. 


The  following  estimate  for  the  number  of  dichotomies  separable  by  a  neuron  was  given  by  Schl^i  [10]  using 


an  elegant  combinatorial  2urgument.  (For  more  accessible  proofs  of  the  result  see  Wendel  [11]  or  Cover  [12].) 


Lemma  3.2  Lei  U  G  IR^  be  an  m-sei  of  points.  The  following  esiimaie  holds: 

\V(IJ)\  <D(N,m)  =  2^2  ( 

•=0  ^ 

Furthermore,  the  upper  bound  of  D{N,m)  is  achieved  for  m- sets  of  points  in  general  position. H 


A  fundamental  parameter  of  interest  to  us  is  the  probabihty  that  a  neuron  can  separate  a  random 
dichotomy  of  a  random  set  of  vertices  of  the  cube  B^.  Let  u*,  . . . ,  u”*  be  a  randomly  drawn  set  of  patterns 
from  the  vertices  of  the  cube  IB^,  and  let  the  pattern  components  {uf  :  i  G  [A^j.o  €  [m]}  form  a  sequence 
of  symmetric  Bernoulli  trials; 

P{uf  = -1}  =  p{ur  = +1}  =  1/2,  ie[N],  qgH. 

To  each  pattern  u®  associate  a  desired  classification  G  B  specified  independently  of  u“ .  We  are  interested 
in  the  probability  P(N,m)  that  there  exists  a  choice  of  weight  vector  w  G  IR^  such  that 


The  following  asymptotic  estimate  for  P(N,ni)  was  shown  by  Fiiredi  [13]. 

Lemma  3.3  If  m  =  0(N)  as  N  —  oo  then 

yv-i 

P(^,i7i)=  ^  b(j;m- 1,0.5)  -  0(e-^>'^). 

j=o 

Remarks;  Note  that  Lemma  3.2  guarantees  that  2~"'D(N,  m)  is  an  upper  bound  for  P{N,  m).  The  asymp¬ 
totic  order  correction  to  this  estimate  in  Lemma  3.3  arises  because  the  probability  that  a  random  m-set  of 
vertices  from  B^  is  in  general  position  rapidly  becomes  small  when  m  exceeds  N.  The  exponentially  small 
order  term  quoted  above  is  a  strengthening  of  Fiiredi’s  original  estimate  of  0(N~'^^).  The  refinement  was 
HAu  m-*et  of  point*  in  N-»p*ce  U  in  §t*tral  posiUon  iff  any  mbaet  of  A/  or  fewer  of  the  point*  i*  linearly  independent. 


Venlcates/i 


11 


made  possible  by  a  new  result  due  to  Kahn,  Komlos,  and  Szemeredi  [14]  which  asserts  that  the  probability 
that  a  random  N  x  N  ±1  matrix  is  singular  decreases  exponentially  fast  with  N.  Incorporating  this  result 
in  Fiiredi’s  proof  (without  any  other  change)  gives  the  quoted  improvement.  We  will  need  the  stronger  form 
of  the  result. 

A  direct  appUcation  of  Chernoff’s  bound  to  the  above  estimate  for  P(N,  m)  yields  the  following 

Lemma  3.4  For  every  fixed  \  >  Q  we  can  find  absolute  positive  constants  c^  and  such  that  as  N  oo: 

P{N,2N{\-X))  =  l-0(e-‘='^), 

P(7V,27V(1  + A))  =  0(c-‘>^). 

This  is  the  well  known  result  that  a  formal  neuron  can  separate  a  random  dichotomy  of  up  to  2N  patterns. 
Using  elementary  binomial  identities,  it  is  easy  to  verify  the  following  “monotone  property.” 

Lemma  3.5  If  k  =  0(n)  as  A  — >  oo  then 

P{N,k)-P{N,k+l)-2-^(^  )  =  C>(e-“/^). 

In  particular,  P{N,k)  is  a  monotone  non-increasing  function  of  k  in  the  vicinity  of  2N  for  large  enough 
N.  Note  that  when  k  =  (2  —  6)N  or  k  =  (2+  6)N  then  Stirling’s  formula  gives  \P{N,k)  —  P(N,k  1)|  = 
0(c-‘i^). 

4  ERROR  TOLERANCE:  CAPACITIES 

4.1  Random  Error  Protocol 

We  first  consider  the  computational  ,^ains  that  are  feasible  under  a  random  error  protocol  with  parameter 
f  €  [0,1/2).  For  any  fixed  i  €  [n],  let  {D®  ;  o  €  [m]}  be  the  sequence  of  decision  sets  corresponding  to 
neuron  t.  Recall  that  the  decision  sets  are  drawn  independently  according  to  a  sequence  of  Bernoulli  trials, 
and 

{uf}  with  probability  1  -  2f, 

{-1,1}  with  probability  2c. 

Theorem  4.1  for  any  fixed  error  parameter  0  <  c  <  1/2,  the  sequence  is  a  capacity  function  for 
7l((n,m),  the  random  error  protocol  with  parameter  e. 
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Proof:  The  tth  neuron  mztkes  e-reliable  decisions  if  the  re  is  a  choice  of  n  -  1  weights  {tuy  :  j  6  [n]  \  {i}} 
such  that 

sgn  f  ^  j  €  Df,  a  =  1, . . .  ,  m. 

Alternatively,  let  A  =  {a  :  Df  is  normal}  be  the  (random)  set  of  indices  identifying  memories  whose  ith 
component  must  be  retrieved.  The  above  is  then  equivalent  to  requiring  that 


Note  that  as  a  consequence  of  the  zero-diagonal  nature  of  the  network,  the  term  uf  is  absent  in  the  sum 
above.  By  the  independence  of  the  memory  components,  if  |A|  =  i  then  the  above  is  equivalent  to  finding 
a  weight  vector  in  (n  —  l)-space  which  separates  a  randomly  and  independently  specified  dichotomy  of  a 
set  of  k  vertices  chosen  randomly  from  1B"“^.  It  is  hence  clear  that  P(n  —  1,1:)  is  the  probability  that 
the  tth  neuron  makes  e-reliable  decisions  conditioned  upon  there  being  k  normal  decision  sets  and  m  —  k 
exceptional  decision  sets.  As  the  distribution  of  normal  and  exceptional  decision  sets  is  governed  by  the 
binomial  distribution  it  follows  that  the  probability  P<(n,  m)  that  the  tth  neuron  makes  e-reliable  decisions 
is  given  by 

m 

P,(n,m)  =  ^6(t;m,l-2c)P(n-  l,k).  (1) 

k=0 

By  Boole’s  inequality,  the  probability  1  -  P{^<(n,m)}  that  one  or  more  neurons  fails  to  make  e-reliable 
decisions  is  bounded  by 

1  -  P  {7Jt(n,m)}  <  n  (1  -  P,(n,  m)J . 

Also,  the  probability  P  {7i«(n,m)}  that  all  the  neurons  make  c-reliable  decisions  is  clearly  bounded  above 
by  the  probability  Pj(n,m)  that  a  single  neuron  makes  e-reliable  decisions.  Combining  this  with  the  above 
inequality  we  have  the  two-sided  bounds; 


1  -  n[l  -  P,(n,m)]  <  P  {7l,(n,m)}  <  Pe(n,m). 

Now  let  0  <  A  <  2e  be  fixed  but  arbitrary.  With  the  choice 

2n 


m  =  (1  —  A) 


1  -2e 


(2) 


(3) 


we  have 


m(l-20(l+A/2) 

A(n,m)>  6(k,m,l-2e)P(n-l.*)=  1- 0(e-'‘") 

t=o 


(n  —  oo). 
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The  first  inequaility  is  obvious  as  it  arises  from  the  deletion  of  terms  in  the  sum  in  (1).  This  lower  bound  for 
Pt(n,m)  can  be  seen  to  approach  1  exponentially  fast  as  asserted  above  by  two  appeals  to  Lemma  3.1;  with 
m  increasing  with  n  as  in  (3),  P(n  —  l,Jk)  =  1  —  for  k  in  the  range  0  <  k  <  m(l  —  2£)(1  +  A/2); 

further,  1  —  2f)  =  1  —  It  follows  from  the  lower  bound  of  (2)  that 

P{1^e(n,m)}  =  1- 0(nc'*»”)  (n oo) 

for  m  growing  as  in  (3).  As  A  is  arbitrary,  2n/(l  —  2c)  is  a  lower  capacity  function  for  Tlt(n,m). 

Now  choose  m  growing  with  n  such  that 

We  then  have 

m(l-2t)(l-A/2) 

P,{n,m)  <  ^2  6(ifc;m,  1  -  2c)  +  P(n  —  l,m(l  -  2c)(l  -  A/2))  + 

i=0 

=  0(c-‘‘")  (n^oo). 

The  first  inequality  follows  from  Lemma  3.5  and  elementary  considerations;  the  exponential  decrease  of  the 
upper  bound  to  zero  is  readily  ascertained  by  applying  Lemma  3.1  twice,  as  before.  The  upper  bound  of  (2) 
hence  yields 

P  {7^<(n,m)}  =  ©(c"**”)  (n  oo) 

for  m  growing  as  in  (4).  As  A  is  arbitrary,  2n/(l  —  2c)  is  ako  an  upper  capacity  function,  hence  a  capacity 
function,  for  'R({n,m).  I 

The  case  where  each  memory  is  required  to  be  a  fixed  point  of  the  network  corresponds  to  the  choice  of  error 
parameter  c  =  0.  The  following  conclusion  is  hence  immediate; 

Corollary  4.2  The  sequence  2n  is  a  capacity  function  for  the  attribute  Tlo{n,m)  that  all  the  memories  are 
fixed  points  of  the  network. 

This  fixed  point  capacity  of  2n  was  also  demonstrated  by  Venkatesh  and  Baldi  [15]  in  the  analysis  of 
fixed  points  of  higher  order  neural  networks.  Recall  the  classical  result  restated  in  Lemma  3.4  that  2n  is  a 
capacity  function  for  a  single  neuron.  (The  relevant  attribute  here  is  the  separation  of  a  random  dichotomy 
of  a  set  of  points  (memories)  in  n-space  by  a  neuron.)  The  corollary  above  asserts  that  there  is  no  decrease 
in  capacity  for  a  zero-diagonal  network  of  n  neurons  even  though  we  now  require  n  dichotomies  of  the  set 
of  memories  to  be  simultaneously  separated.  (As  seen  in  the  proof  of  the  theorem,  neuron  i,  for  instance,  is 
required  to  dichotomise  the  set  of  memories  according  to  the  set  of  signs  {u“  ;  a  €  [m]}.) 
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Theorem  4.1  hence  asserts  that  if  the  fixed  point  requirement  on  the  memories  is  relaxed  and  it  is  only 
required  that,  starting  at  any  memory,  a  synchronous  state  transition  results  in  a  new  state  no  more  than 
en  bits  away  from  the  memory  on  average,  then  the  capacity  increases  by  a  constant  multiplicative  factor  of 
1/(1  —  2c).  Note,  however,  that  the  rate  of  increase  of  the  capacity  function  remains  linear  in  n  and  is  not 
improved  in  the  random  error  protocol. 

Theorem  4.1  remains  true  if  we  are  interested  in  heiero-associaiive  maps  u“  *— ►  u“  rather  than  the 
auto-associative  maps  u®  t-*  u®  that  we  have  hitherto  considered.  In  particular,  for  a  =  1,  ...,  m  let  the 
associated  memories  u®  be  chosen  independently  from  B”  and  with  components  drawn  from  a  sequence  of 
symmetric  Bernoulli  trials  (independent  of  the  memories  u®).  The  decision  sets  D®  are  now  specified  in 
natural  fashion  by  a  sequence  of  Bernoulli  trials  with 


f  {“f }  with  probability  1  —  2c, 

\  {—1,1}  with  probability  2c. 


It  is  easy  to  see  that  the  proof  of  the  theorem  carries  over  in  toto  for  this  hetero-associative  case.  A  more 
direct  proof  can  be  crafted,  however,  with  the  observation  that  the  independence  of  the  components  of  the 
associated  memories  u®  yields 

P{7e,(n,m)}=:P.(n,m)". 


Lemma  3.1  now  readily  yields  that  2n/(l  —  2c)  is  both  a  lower  and  an  upper  capacity  function  for  m 


)■•* 


4.2  Exhaustive  Error  Protocol 


For  c  close  to  1/2  the  multiplicative  improvement  of  1/(1  —  2c)  to  the  capacity  that  arises  for  the  random 
error  protocol  can  become  quite  large;  the  gains  may  nonetheless  be  perceived  as  unsatisfactory  as  there  is 
no  improvement  in  the  rate  of  increase  of  capacity  with  n.  The  exhaustive  error  protocol  would  seem  to 
confer  even  greater  flexibility  in  the  choice  of  errors — one  is  allowed  in  principle  to  examine  every  admissible 
configuration  of  errors  before  selecting  the  most  favourable  configuration — ;  as  argued  heuristically  in  the 
Introduction,  this  might  augur  well  for  a  large  improvement  in  capacity.  We  show  in  this  section,  however, 
that  while  there  is  a  further  improvement  in  the  multiplicative  constant,  the  capacity  function  for  the 
exhaustive  error  protocol  is  still  linear  in  n. 

As  a  first  step  let  us  show  that,  in  accordance  with  intuition,  the  exhaustive  error  protocol  attains 
capacities  at  least  as  large  as  those  of  the  random  error  protocol. 

**We  had  presented  these  results  without  proof  for  the  hetero-associative  case  in  [l].  There  we  had  assumed  in  addition  that 
the  memories  U"  were  drawn  from  an  absolutely  continuous  distribution  in  EucUdean  n-space  R",  and  not  from  the  vertices 
of  the  cube  B"  as  is  more  natural  in  the  rectirrent  network  context.  The  increased  dependency  structure  in  the  problem  makes 
the  emt  of  auto-association  with  memories  drawn  from  B"  somewhat  harder  technically;  the  improvement  to  Fiiredi's  lemma 
quoted  in  the  text  is  necessary  here. 
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Theorem  4.3  For  anii  fixed  error  parameter  0  <  f  <  1/2,  the  sequence  is  a  lower  capacity  function 
for  £((n,m),  the  exhaustive  error  protocol  with  parameter  e. 

Proof:  If  c  =  0  there  is  nothing  to  prove.  Let  us  hence  assume  t  >  0.  Now  for  any  choice  0  <  e'  <  t, 
the  sequence  2n/(l  —  2€')  is  a  capacity  function  for  the  random  error  protocol  with  parameter  c'.  Invoking 
Lemma  3.1,  within  capacity  the  number  of  errors  in  each  memory  that  result  in  the  random  error  protocol  is 
no  more  than  (t'  +  o(l))n  with  probabiUty  approaching  one  as  n  — ►  oo.  For  any  <  f  the  number  of  errors 
in  each  memory  is  hence  less  than  cn  with  probabiUty  one.  Consequently,  2n/(l  —  2c')  is  a  lower  capacity 
function  for  the  exhaustive  error  protocol  with  parameter  c.  As  c'  <  c  can  be  chosen  arbitrarily  close  to  c,  it 
follows  from  the  definition  of  capacity  that  2n/(l  —  2c)  is  a  lower  capacity  for  the  exhaustive  error  protocol 
with  parameter  c.  I 

Theorem  4.4  Let  k<  be  a  function  of  the  error  tolerance  e  defined  by  the  unique  solution  of 

W  (^^)  + /^(e)  =  log2,  0<c<i  (5) 

where  H  is  the  binary  entropy  function.  Then  the  sequence  is  an  upper  capacity  function  for 
the  exhaustive  error  protocol  with  parameter  c. 

Remark:  The  function  is  defined  as  we  vary  c  in  the  interval  0  <  c  <  1/2,  cmd  monotonically  increases 
from  a  value  of  +1  at  c  =  0  to  a  value  close  to  505  as  c  approaches  1/2.  For  small  c  it  remains  close  to  +1, 
so  that  the  capacity  function  for  the  attribute  £((n,m)  stUl  behaves  Uke  2n/(l  —  2c). 

Proof:  Assume  that  a  particular  choice  of  weights  for  the  zero-diagonal  network  of  neurons  results  in  the 
synchronous  transitions:  u“  i-*  v“,  a  =  1,  ...,  m.  Recah  that  the  tth  neuron  makes  a  decision  error  on 
memory  u®  if  vf  ^  uf ,  i.e.,  the  Jth  component  of  memory  u®  is  not  retrieved.  The  key  observation  here  is 
the  following:  if  each  v®  differs  from  the  corresponding  memory  u®  in  no  more  than  cn  components,  then 
there  exists  at  least  one  neuron  which  makes  cm  or  fewer  decision  errors.  In  fact,  if  there  is  no  neuron  which 
makes  cm  or  fewer  decision  errors,  then  the  total  number  of  component  errors  after  one  synchronous  step 
summed  over  all  the  m  memories  will  exceed  cmn  so  that  there  has  to  be  at  least  one  memory  u®  which 
is  mapped  to  a  point  v®  which  is  at  a  Hamming  distance  larger  than  cn  from  u®.  But  this  contradicts  the 
earlier  assumption  about  the  points  v®. 

Now,  for  any  selection  of  values  v®  G  B,  a  =  1,  . . . ,  m,  the  probability  that  the  ith  neuron  can  realise 
the  maps  u®  v,®  is  exactly  P(n  -  l,m).  The  number  of  ways  in  which  cm  or  fewer  decision  errors  (i.e., 
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the  set  {a  :  v“  ^  uf })  can  occur  is  (  " )  •  Combining  Boole’s  inequality  with  the  observation  above, 

we  hence  have  the  upper  bound 

em  ,  V 

P{5j(n,m)}  <  nP(n- l,m)^  (  )• 

t=c  ^  '' 

Let  A  >  0  be  fixed,  but  arbitrary,  and  choose  m  =  2K<n(l  +  A)/(l  —  2£),  where  ««  is  as  defined  in  (5). 
Two  applications  of  Lemma  3.1  yield 

P(n  -  1,  m)  <  e-(""i)n°82-H{(i-20/2-c.(i+A)}]^ 

and 

ic=0  ' 

for  large  enough  n.  Hence,  for  each  choice  of  0  <  c  <  1/2  and  A  >  0,  there  is  a  choice  of  0  >  0  such  that 

The  binary  entropy  function  H{c)  increases  monotonically  from  a  value  c.  0  at  c  =  0  to  a  vadue  of  log  2  at 
c  =  1/2.  Hence,  with  Kj  as  in  (5),  P{(1  -  2£)/2K((1  +  A)}  +  H{€)  <  iog2.  Hence,  for  every  choice  of  A  >  0 
and  m  =  2Ktn(l  +  A)/(l  —  2£)  there  is  a  choice  of  6  >  0  such  that  P  {^e(”.  •i’)}  <  c"*”*  — ♦  0  as  n  — ♦  oo.  I 

Allowing  an  error  tolerance  of  up  to  cn  bits  in  the  recall  of  any  memory  in  an  associative  memory 
application  corresponds  to  the  requirement  that  state  transitions  be  confined  to  the  ball  of  radius  en  at 
a  memory  once  a  transition  leads  within  the  ball.  The  exhaustive  error  protocol  allows  any  synchronous 
transition  from  a  memory  that  does  not  leave  the  Hamming  ball  of  radius  cn  at  the  memory.  It  is  clear  that 
this  is  a  necessary  condition  that  must  be  satisfied  if  we  require  confienement  of  trajectories  within  balls  of 
radius  en  at  the  memories.  Consequently,  Theorem  4.4  implies  the  following  result:  ./  memory  components 
are  drawn  from  a  sequence  of  symmetric  Bernoulli  trials,  then  no  algorithm  for  storing  memories  »n  a 
recurrent  neural  network  can  achieve  a  capacity  which  increases  faster  than  linearly  with  n;  in  particular, 
if  a  linear  number  of  errors  tn  is  permitted  in  the  recall  of  any  memory,  then  1010n/(l  —  2e)  is  an  upper 
capacity  function  for  any  algorithm. 

5  EXTENSIONS 

The  error  protocols  that  we  had  defined  in  Section  2  for  fully-interconnected  networks  can  be  readily  extended 
to  arbitrary  network  architectures.  We  briefly  derive  here  certain  bounds  on  the  capacity  of  feedforward 
neural  networks  when  there  is  an  allowed  tolerance  to  output  error. 
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An  L-layer  feedforward  neural  network  is  comprised  of  L  ordered  subcollections  of  neurons  called  layers 
with  interconnections  specified  as  follows;  for  I  =  2,  L  the  inputs  to  the  Ith  layer  are  obtained  from 
the  outputs  of  the  (/  —  l)th  layer.  The  inputs  to  the  first  layer  are  the  network  inputs,  and  the  outputs  of 
the  Lth  layer  are  the  network  outputs.  Let  n  =  (noni  •••  nj+i)  denote  a  vector  of  positive  integers.  We 
use  the  nonce  terminology  n-neiwork  to  mean  an  (/  +  l)-layer  feedforward  neural  network  which  has  no  =  n 
inputs  and  whose  tth  layer  contains  nj  neurons.  For  simplicity  we  will  restrict  ourselves  to  the  case  of  a 
single  output  neuron,  nj+i  =  1.  An  n-network  then  realises  a  Booleeui  function  of  no  =  n  inputs.  The  error 
protocols  are  readily  extended  to  this  case. 

Let  u^,  . . . ,  u”*  be  any  set  of  points  from  Euclidean  n-space.  To  each  point  u“  we  aissign  a  desired 
classification  v°  G  B.  We  assume  that  the  set  of  desired  classifications  ,«”*}  are  drawn  from  a 

sequence  of  symmetric  Bernoulli  trials. 

Analogously  with  the  fully-interconnected  case,  in  the  random  error  protocol  we  independently  assign 
decision  sets  to  each  classification  v"  with 

_  f  {*'“}  with  probability  1  —  2f , 

~  \  B  with  probability  2c. 

For  a  particular  assignment  of  weights  to  the  n-network  we  say  that  the  network  makes  c-reliable  decisions 
on  the  set  of  points  {u**  :  a  €  [m]}  if  u®  D®  for  each  a  G  (m].  The  attribute  of  interest  is 

EVENT  72<(n,  m):  There  is  a  choice  of  weights  such  that  the  n-network  makes  e-reliable  decisions 
on  the  m-set  of  points  {u®  :  a  G  [m]}- 

A  completely  analogous  development  leads  to  the  following  attribute  for  the  exhaustive  error  protocol. 

EVENT  S((n,  m):  There  is  a  choice  of  weights  such  that  the  n-network  makes  no  more  than  cm 
classification  errors  on  the  m-set  of  points  {u®  :  o  G  [rn]}. 

The  notion  of  capacity  is  defined  as  before  as  a  threshold  function  of  an  attribute  as  the  input  dimension 
n  becomes  large.  (Note  that  we  tacitly  assume  a  family  of  feedforward  network  architectures  with  the  number 
of  elements  in  each  layer  Uj  a  function  of  n.) 

Now  let  £>(n,  m)  denote  the  number  of  dichotomies  of  {u®  :  o  G  [m]}  that  can  be  separated  by  an 
n-network.  The  following  simple  overestimate  for  D(n,m)  is  obtained  by  applying  Lemma  3.2. 

I 

D(n,f7i)  < 

t=0 

As  the  classifications  are  symmetric  Bernoulli  it  follows  that  the  probability  P(n,  m)  that  a  random  dichotomy 
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can  be  separated  by  the  network  can  be  bounded  by 


'  r  ' 

P(n,m)  =  2-'”£>(n,m)  <  2“’” =exp  -m  log2+  logD(n.,m)  . 

»=o  i=o 


Using  the  easy  bound  D{ni ,  m)  <  m"'  we  get 


Define  the  function  C'i(n)  by 


Note  that 


P(n,  m)  <  exp  -m  log  2  +  lo 

Ci(n)  log2  =  logCi(n). 


It  is  clear  that  for  any  A>0,  ifm>(l  +  k)Ci(n)  then  P(n,m)  — ►  0  as  n  — »  oo.  As  before,  we  have 

m 

P{7i,(n,  m)}  =  52 1  -  2f)  /’(n,  it)- 
*=o 

The  same  argument  in  the  proof  of  Theorem  4.3  continues  to  work,  so  that  we  have  proved  the  following 
Theorem  5.1  The  sequence  is  on  upper  capacity  function  for  the  attribute 


For  the  exhaustive  error  protocol,  we  have  likewise 


P{f,(n,m)}  <  P(n 


<  exp  — m  log  2  + 


ni+i  logm  +  mP(f) 


For  a  small  enough  choice  of  error  parameter  e  let  C2(e;  n)  satisfy 


C2(f;n)[log2  -  P(t)] 


=  (gn.™,..) 


logC2(f;n). 


Again  we  have 


C2(f:n)  = 


log2-P(f) 


log  ( [i  4  0 

i  =  0  /  \i=0  /  L  V  /J 


For  any  fixed  A>0ifm>(l  +  A)C2(f;ii)  then  P{i^,(n,  m)}  — ♦  0  as  n  — ►  oo.  Hence  we  have  the  following 
Theorem  5.2  The  sequence  C2(f;n)  is  an  upper  capacity  function  for  the  attribute  €t{n,m). 


The  results  can  be  sharpened,  for  instance,  by  using  the  tighter  bound  Z>(ni,m)  <  2m"'''/(ni  -  1)! 
which  is  valid  if  m  >  3n,  and  n,  >  3  instead  of  the  simpler  estimate  D(n,  ,  m)  <  m’*-  used  here.  Unfortunately, 
good  lower  bounds  are  currently  unavailable  except  for  the  case  of  one  and  two  layer  networks  (cf.  Venkatesh 
and  Psaltis  [6];  Baum  [16]). 
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6  CONCLUSIONS 

The  main  contributions  of  this  paper  are  the  development  of  formal  protocols  for  error  tolerance,  and  the 
explicit  computation  of  the  gains  that  accrue  for  neural  network  capacity  under  these  protocols  when  errors 
are  permitted  in  the  output.  The  principal  result  here  is  that  error  tolerance  in  a  network  situation  results 
in  gains  in  the  multiplicative  constants  for  the  capacity,  but  leaves  the  rate  of  growth  of  capacity  as  input 
dimensionality  increases  unchanged.  In  particular,  if  a  tolerance  of  e  €  [0,1/2)  is  specified  for  a  fully- 
interconnected  network,  then  under  the  random  error  protocol  there  is  a  gain  in  capacity  by  a  multiplicative 
factor  of  1/(1  —  2e)  over  the  error  free  case,  while  for  the  exhaustive  error  protocol  the  gain  is  no  more  than 
a  multiplicative  factor  of  505/(1  -  2c).  Similar  gains  accrue  for  feedforward  network  configurations.  The 
absence  of  more  startling  gains  in  capacity  can  be  traced  to  the  exponential  decay  of  the  relevant  probabilities. 
These  protocols  are  readily  applicable  in  other  computational  paradigms.  Following  the  analysis  here,  in  a 
general  computational  scenario  we  would  expect  error  tolerance  to  buy  us  order  of  magnitude  improvements 
in  computational  capability  only  if  the  relevant  probabilities  decay  sufficiently  slowly. 
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Figure  Captions 

Fig.  1  A  schematic  demonstrating  error  correction  (attraction)  and  error  tolerance  for  a  memory.  Points  in 
the  Hamming  ball  of  radius  pn  at  the  memory  lie  on  trajectories  which  ultimately  are  confined  in  the 
(smaller)  Hamming  ball  of  radius  rn  at  the  memory. 

Fig.  2  A  schematic  showing  a  set  of  “admissible”  transitions  starting  from  a  memory.  Each  such  transition 
from  a  memory  must  result  in  a  new  vertex  of  the  cube  {—1, 1}"  which  differs  from  the  memory  in  no 
more  than  en  components. 


State  space  {-1, 1}” 


Despite  recent  progress,  our  present  understanding  of  these  concepts  in  the  con¬ 
text  of  neural  networks  is  obstructed  by  complexities  in  the  functional  form  of  the 
network  and  in  the  classification  problems  themselves. 

In  this  correspondence  we  will  present  analytic  results  on  these  issues  for  the  nearest- 
neighbor  classifier.  Noted  for  its  algorithmic  simplicity  and  nearly  optimal  perfor¬ 
mance  in  the  infinite  sample  limit,  this  pattern  classifier  plays  a  central  role  in  the 
field  of  pattern  recognition.  Furthermore,  because  it  uses  proximity  in  feature  space 
as  a  mectsure  of  class  similarity,  its  performance  on  a  given  classification  problem 
should  yield  qucditative  cues  to  the  performance  of  a  neural  network.  Indeed,  a 
nearest-neighbor  classifier  can  be  readily  implemented  as  a  “winner-take-all”  neural 
network. 

2  THE  TASK  OF  PATTERN  CLASSIFICATION 

We  begin  with  a  formulation  of  the  two-class  problem  (Duda  and  Hart,  1973); 

Let  the  labels  wj  and  u>2  denote  two  states  of  nature,  or  pattern  clatsses. 

A  pattern  belonging  to  one  of  these  two  classes  is  selected,  and  a  vector  of 
n  features,  x,  that  describe  the  selected  pattern  is  presented  to  a  pattern 
classifier.  The  classifier  then  attempts  to  guess  the  selected  pattern’s  class 
by  assigning  x  to  either  wi  or  wo. 

As  an  example,  the  two  class  labels  might  represent  the  states  benign  and  malignant 
as  they  pertain  to  the  diagnosis  of  cancer  tumors;  the  feature  vector  could  then  be 
a  1024  X  1024  pixel,  real-valued  representation  of  an  electron-microscope  image.  A 
pattern  classifier  can  thus  be  viewed  as  a  mapping  from  an  n-dimensional  feature 
space  to  the  discrete  set  {wi,W2},  and  can  be  specified  by  demarcating  the  regions 
in  the  n-dimensional  feature  space  that  correspond  to  and  u)2-  We  define  the 
decision  region  72 1  as  the  set  of  feature  vectors  that  the  pattern  classifier  assigns  to 
wi,  with  an  analogous  definition  for  722.  A  useful  figure  of  merit  is  the  probability 
that  the  feature  vector  of  a  randomly  selected  pattern  is  assigned  to  the  correct 
class. 


2.1  THE  BAYES  CLASSIFIER 


If  sufficient  information  is  available,  it  is  possible  to  construct  an  optimal  patt.^rn 
classifier.  Let  P(uii)  and  f*(w2)  denote  the  prior  probabilities  of  the  two  states  of 
nature.  (For  our  cancer  diagnosis  problem,  the  prior  probabilities  cem  be  estimated 
by  the  relative  frequency  of  each  type  of  tumor  in  a  large  statistical  sample.)  Fur¬ 
ther,  let  p(x  I  wi)  and  p(x  |  U2)  denote  the  class-conditional  probability  densities  of 
the  feature  vector  for  the  two  class  problem.  The  total  probability  density  is  now 
defined  by  p(x)  =  p(x  1  wi)P(u;i)  -t- p(x  (  w2)P(w2),  and  gives  the  unconditional 
distribution  of  the  feature  vector.  Where  p(x)  ^  0  we  can  now  use  Bayes’  rule  to 
compute  the  posterior  probabilities: 


P(wi  1  X)  = 


p(x  I  Wi)P(wi) 


and  P(w2  I  x)  = 


p(x  I  W2)P(W2) 


p(x)  —  -  p(x) 

The  Bayes  classifier  assigns  an  unclassified  feature  vector  x  to  the  class  label  having 


the  greatest  posterior  probability.  (If  the  posterior  probabilities  happen  to  be  equal, 
then  the  class  assignment  is  arbitrary.)  With  'R.\  and  Ti.2  denoting  the  two  decision 
regions  induced  by  this  strategy,  the  probability  of  error  of  the  Bayes  classifier,  Pb, 
is  just  the  probability  that  x  is  drawn  from  class  u\  but  lies  in  the  Bayes  decision 
region  7^2  >  or  conversely,  that  x  is  drawn  from  class  ijJ2  but  lies  in  the  Bayes  decision 
region  Hi: 

Pb=  f  P{uii\x)p{x)(Px+  f  P(w2\x)p{x)crx. 

Jn,  Jllx 

The  reader  may  verify  that  the  Bayes  classifier  minimizes  the  probability  of  error. 

Unfortunately,  it  is  usually  impossible  to  obtain  expressions  for  the  class-conditional 
densities  and  prior  probabilities  in  practice.  Typically,  the  available  information 
resides  in  a  set  of  correctly  labeled  patterns,  which  we  collectively  term  a  training 
or  reference  sample.  Over  the  last  few  decades,  numerous  pattern  classification 
strategies  have  been  developed  that  attempt  to  learn  the  structure  of  a  classification 
problem  from  a  finite  training  sample.  (The  backpropagation  algorithm  is  a  recent 
example.)  The  underlying  hope  is  that  the  classifier’s  performance  can  be  made 
acceptable  with  a  sufficiently  large  reference  sample.  In  order  to  understand  how 
large  a  sample  may  be  needed,  we  turn  to  what  is  perhaps  the  simplest  learning 
algorithm  of  this  class. 

3  THE  NEAREST-NEIGHBOR  CLASSIFIER 

Let  Xm  =  {(x^^\  ^(^^),  . . . ,  denote  a  finite  reference  sam¬ 

ple  of  M  feature  vectors,  x^*)  €  R",  with  corresponding  known  class  assignments, 
$(•)  £  {un,u)2}.  The  nearest-neighbor 
rule  assigns  each  feature  vector  x  to 
class  wi  or  «4;2  as  a  function  of  the  ref¬ 
erence  M-sample  as  follows: 

•  Identify  {x' ,6')  6  Xm  such  that 
||x-x'll  <  l|x-x(‘>I|  for  i  ranging 
from  1  through  M; 

•  Assign  x  to  class  0'. 

Here,  llx-y|)  =  de- 

notes  the  Euclidean  metric  in  R^.^The 
nearest-neighbor  rule  hence  classifies 
each  feature  vector  x  according  to  the 
label,  0',  of  the  closest  point,  x',  in 
the  reference  sample.  As  an  example,  Figure  1:  The  decision  regions  induced 
we  sketch  the  nearest-neighbor  deci-  by  a  nearest-neighbor  classifier  with  a 

sion  regions  for  a  two-dimensional  clas-  seven-element  reference  set  in  the  plane, 

sification  problem  in  Fig.  1. 

'Other  metrics,  such  as  the  more  general  Minkowski-r  metric,  are  also  possible. 


It  is  interesting  to  consider  how  the  performance  of  this  classifier  compares  with  that 
of  a  Bayes  classifier.  To  facilitate  this  analysis,  we  assume  that  the  reference  patterns 
are  selected  from  the  total  probability  density  p(x)  in  a  statistically  independent 
manner  (i.e.,  the  choice  of  Xj  does  not  in  any  way  bias  the  selection  of  and 

^0+1)).  Furthermore,  let  Pji/(error)  denote  the  probability  of  error  of  a  nearest- 
neighbor  classifier  working  with  the  reference  sample  Xm,  and  let  Poo  (error)  denote 
this  probability  in  the  infinite  sample  limit  (Af  — ►  oo).  'We  will  also  let  S  denote 
the  volume  in  feature  space  over  which  p(x)  is  nonzero.  The  following  well  known 
theorem  shows  that  the  nearest-neighbor  classifier,  with  an  infinite  reference  sample, 
is  nearly  optimal  (Cover  and  Hart,  1967).^ 

Theorem  1  For  the  iwo-class  problem  in  the  infinite  sample  limit,  the  probability 
of  error  of  a  nearest-neighbor  classifier  tends  toward  the  value, 

Poo  (error)  =  i  |  x)P{u)2  I  x)p(x)  cTx, 

which  is  furthermore  bounded  by  the  two  inequalities, 

Pb  <  Poo(error)  <  2Ps(l  -  Pb), 
where  Pg  is  the  probability  of  error  of  a  Bayes  classifier. 

This  encouraging  result  is  not  so  surprising  if  one  considers  that,  with  probability 
one,  about  every  feature  vector  x  is  centered  a  ball  of  radius  e  that  contaiins  an 
infinite  number  of  reference  feature  vectors  for  every  £  >  0.  The  annoying  factor  of 
two  accounts  for  the  event  that  the  nearest  neighbor  to  x  belongs  to  the  class  with 
smaller  posterior  probability. 

3.1  THE  ASYMPTOTIC  CONVERGENCE  RATE 

In  order  to  satisfactorily  address  the  issues  of  generalization  eind  scalability  for  the 
nearest-neighbor  classifier,  we  need  to  consider  the  rate  at  which  the  performance  of 
the  classifier  approaches  its  infinite  sample  limit.  The  following  theorem  applicable 
to  nearest-neighbor  classification  in  one- dimensional  feature  spaces  was  shown  by 
Cover  (1968). 

Theorem  2  Lei  p(x  )  wi)  and  p(x  |  have  uniformly  bounded  third  derivatives 
and  lei  p(x)  be  bounded  away  from  zero  on  S.  Then  for  sufficiently  large  M, 

PAf  (error)  =  Poo  (error)  +  O 

Note  that  this  result  is  also  very  encouraging  in  that  eui  order  of  magnitude  increase 
in  the  sample  size,  decreases  the  error  rate  by  two  orders  of  magnitude. 

The  fo' lowing  theorem  is  our  main  result  which  extends  Cover’s  theorem  to  n- 
dimensional  feature  spaces: 

^Originally,  this  theorem  was  stated  for  multiclass  decision  problems;  it  is  here  presented 
for  the  two  class  problem  only  for  simplicity. 


Theorem  3  Lei  p(x  |  Wi),  p(x  |  W2),  an<i  p(x)  satisfy  ike  same  conditions  as  in 
Theorem  2.  Then,  there  exists  a  scalar  a  (depending  on  n)  such  that 

Pm  (error)  ~  Poo  (error)  + 

where  the  right-hand  side  describes  the  first  two  terms  of  an  asymptotic  expansion 
in  reciprocal  powers  ofM^/'^.  Explicitly, 


_  r(i  +  l)  (r(f +  1))' 


where, 


7.-.(x)  = 


.  5P(w2  I  x)  I  x) 

P(u,r  I  X)— - +  —^^3 - P(u,2  I  X). 


For  n  =  1  this  result  agrees  with  Cover’s  theorem.  With  increasing  n,  however, 
the  convergence  rate  significantly  slows  down.  Note  that  the  constant  a  depends  on 
the  way  in  which  the  class-conditional  densities  overlap.  If  a  is  bounded  away  from 
zero,  then  for  sufficiently  small  ^  >  0,  Pjif(error)  —  Poo(error)  <  6  is  satisfied  only 
if  M  >  (a/6)"^^  so  that  the  sample  size  required  to  achieve  a  given  performance 
criterion  is  exponential  in  the  dimensionality  of  the  feature  space.  The  above  pro¬ 
vides  a  sufficient  condition  for  Bellman’s  well  known  “curse  of  dimensionality”  in 
this  context. 

It  is  also  interesting  to  note  that  one  can  easily  construct  classification  problems  for 
which  a  vanishes.  (Consider,  for  example,  p(x  |  wi)  =  p(x  |  wj)  for  all  x.)  In  these 
cases  the  higher-order  terms  in  the  asymptotic  expansion  are  important. 


4  A  NUMERICAL  EXPERIMENT 

A  conspicuous  weakness  in  the  above  theorem  is  the  requirement  that  p(x)  be 
bounded  away  from  zero  over  S.  In  exchange  for  a  uniformly  convergent  asymptotic 
expansion,  we  have  omitted  many  important  probability  distributions,  including 
normal  distributions.  Therefore  we  numerically  estimate  the  asymptotic  behavior 
of  Pm  (error)  for  a  problem  consisting  of  two  normally  distributed  classes  in  R": 

=  (2ir'^r>r-  [-^  ~  ^;)1  ■ 


Assuming  that  P(wi)  =  P(w2)  =  1/2,  we  find 


Poo  (error)  =  — sech  dx. 

(TV2t  Jo  Vo--/ 


log,o(M) 


Figure  2:  Numerical  validation  of  the  nearest-n''’ghbor  scaling  hypothesis  for  two 
normally  distributed  classes  in  R” . 

For  =  c  =  1,  Poo  (error)  is  numerically  found  to  be  0.22480,  which  is  consistent 
with  the  Bayes  probability  of  error,  Pb  =  (l/2)erfc(l/\/2)  =  0.15865.  (Note  that 
the  expression  for  a  given  in  Theorem  3  is  undefined  for  these  distributions.)  For 
n  ranging  from  1  to  5,  and  M  ranging  from  1  to  200,  three  estimates  of  error) 
were  obtained,  each  as  the  fraction  of  “failures”  in  160,000  or  more  Bernoulli  trials. 
Each  trial  consists  of  constructing  a  pseudo-random  sample  of  M  reference  patterns, 
followed  by  a  single  attempt  to  correctly  classify  a  random  input  pattern.  These 
estimates  of  Pm  are  represented  in  Figure  2  by  circular  markers  for  n  =  1,  crosses 
for  n  =  2,  etc.  The  lines  in  Figure  2  depict  the  power  law 

P.vr(error)  =  Poo(error)  +  , 

where,  for  each  n,  6  is  chosen  to  obtain  an  appealing  fit.  The  agreement  between 
these  lines  and  data  points  suggests  that  the  asymptotic  scaling  hypothesis  of  The¬ 
orem  3  can  be  extended  to  a  wider  class  of  distributions. 


5  DISCUSSION 


The  preceding  analysis  indicates  that  the  convergence  rate  of  the  nearest-neighbor 
classifier  slows  down  dramatically  as  the  dimensionality  of  the  feature  space  in¬ 
creases.  This  rate  reduction  suggests  that  proximity  in  feature  space  is  a  less  effec¬ 
tive  measure  of  class  identity  in  higher  dimensional  feature  spaces.  It  is  also  clear 
that  some  degree  of  smoothness  in  the  class-conditional  densities  is  necessary,  as 
well  as  sufficient,  for  the  asymptotic  behavior  described  by  our  analysis  to  occur: 
in  the  absence  of  smoothness  conditions,  one  can  construct  classification  problems 
for  which  the  ne2irest-neighbor  convergence  rate  is  arbitrarily  slow,  even  in  one  di¬ 
mension  (Cover,  1968).  Fortunately,  the  most  pressing  classification  problems  are 
typically  smooth  in  that  they  are  constrained  by  regularities  implicit  in  the  laws  of 
nature  (Marr,  1982).  With  additional  prior  information,  the  convergence  rate  may 
be  enhanced  by  selecting  a  fewer  number  of  descriptive  features. 

Because  of  their  smooth  input-output  response,  neural  networks  appear  to  use  prox¬ 
imity  in  feature  space  as  a  basis  for  classification.  One  might,  therefore,  expect  the 
required  sample  size  to  scale  exponentially  with  the  dimensionality  of  the  feature 
space.  Recent  results  from  computational  learning  theory,  however,  imply  that  with 
a  sample  size  proportional  to  the  capacity — a  combinatorial  quantity  which  is  char¬ 
acteristic  of  the  network  architecture  and  which  typically  grows  polynomially  in  the 
dimensionality  of  the  feature  space — one  cam  in  principle  identify  network  param¬ 
eters  (weights)  which  give  (close  to)  the  smallest  classification  error  for  the  given 
architecture  (Baum  and  Haussler,  1989).  There  are  two  caveats,  however.  First, 
the  information-theoretic  sample  complexities  predicted  by  learning  theory  give  no 
clue  as  to  whether,  given  a  sample  of  the  requisite  size,  there  exist  any  algorithms 
that  can  specify  the  appropriate  parameters  in  a  reasonable  time  frame.  Second, 
and  more  fundamental,  one  cannot  in  general  determine  whether  a  particular  ar¬ 
chitecture  is  intrinsically  well  suited  to  a  given  classification  problem.  The  best 
performance  achievable  may  be  substantially  poorer  than  that  of  a  Bayes  classifier. 
Thus,  without  sufficient  prior  information,  one  must  search  through  the  space  of 
all  possible  network  architectures  for  one  that  does  fit  the  problem  well.  This  situ¬ 
ation  now  effectively  resembles  a  non-parametric  classifier  and  the  analytic  results 
for  the  sample  complexities  of  the  nearest-neighbor  classifier  should  provide  at  least 
qualitative  indications  of  the  corresponding  case  for  neural  networks. 
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Abstract 

The  following  coin  tossing  game  is  analysed:  A  store  of  N  fair  coins  is  given  and  it  is  desired 
to  achieve  A/  heads  in  a  round  of  tosses  of  the  coins.  To  allow  for  unfavourable  sequences  of  tails, 
restarts  are  permitted  at  any  epoch  in  the  game  where,  in  any  restart,  all  coins  are  returned 
to  store  and  tosses  are  begun  anew  from  tabula  rasa.  A  restart  strategy  is  a  prescription 
which  specifies  when  a  restart  should  be  made.  It  is  desired  to  estimate  the  minimum  expected 
duration  of  the  game  over  all  restart  strategies,  and  to  find  an  optimal  strategy  which  minimises 
the  expected  duration  of  the  game.  This  simple  coin  tossing  game,  proposed  by  R.  L.  Rivest, 
has  cryptographic  roots  and  is  linked  to  issues  in  the  factoring  of  integers. 

I.NDEX  Terms:  Backward  Induction.  Com  Tossing  Game,  Cryptography,  Integer  Factoring, 
Markov  Decision  Problem.  Optimal  Stopping 


1  TWENTY  QUESTIONS 

R.  L.  Rivest  proposed  the  following  problem.  An  individual  Hek  20  fair  coins  in  his  pocket.  lie  takes 
coins  out  of  his  pocket  one  at  a  time  and  tosses  them,  his  objective  being  to  obtain  15  heads.  If 
fewer  than  15  heads  transpire  in  any  round  of  20  tosses,  he  must  return  all  20  coins  to  his  pocket 
and  restart  the  game.  He  also  has  the  option  of  restarting  the  game  at  any  point  by  ending  a  round 
of  tosses  and  returning  all  20  coins  to  his  pocket  before  starting  einew.  The  problem  facing  our 
protagonist  is  to  choose  an  optimal  restart  strategy  which  would  minimise  the  e.xpected  number  of 
tosses  he  has  to  make  before  achieving  his  goal  of  15  heads. 

The  general  problem  where  A/  heads  are  desired  in  tosses  out  of  a  store  of  N  fair  coins,  with 
restarts  allowed,  has  cryptographic  roots  and,  in  particular,  is  related  to  the  problem  of  choosing  an 
optimal  early  abort  strategy  in  randomised  algorithms  for  factoring  integers. 

Consider  for  instance  the  problem  of  factoring  an  integer  n  A  basic  approach  to  finding  a  factor 
of  n  is  to  first  find  integers  k  and  /  such  th.at 

k^  =  I'  (mod  n).  0<k.l<n.  k^l.  k  +  l^n.  (I) 

In  fact,  this  congruence  implies  that  n  is  a  divisor  of  {k^  —  1‘)  yet  n  divides  neither  {k  —  1)  nor 
(k  +  /).  Ir  follows  that  gcd(Ti,  k  -  1)  and  gcd(n,  it  +  /)  are  proper  factors  of  n,  and  these  can  be  found 
efficiently,  for  instance,  by  Euclid's  algorithm  (Euclid  (1,  Book  VII,  Propositions  1,  2]!). 
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Now  let  a  number  m  <  n  be  fixed  and  let  ir(m)  =  AT  denote  the  number  of  primes  less  than 
J71.  Denote  these  primes  by  pi,  ...  ,  p^.  Now  note  that  it  suffices  to  derive  A’  +  1  distinct  solutions 
. . .  .his),  1  <  i  <  N  +  1  to  the  congruence 

=  Pi' Pn"  ■  ■  ■  Pn'  (mod  n),  ('J) 

for  then  the  vectors  (hn , . . . ,  h,yv),  1  <  J  <  A  +  1  are  linearly  dependent  modulo  2,  i.e.,  there  exists 
(li'i, ,  /i'y)  such  that 

(hill  •  ■  •  •  +  ■  •  •  +  (^/v+i,i. •  •  ■  =  (2hi, .... 2h'ni). 

The  integers 

Jt  =  (Jbi  ■  •  •  ks+i)  mod  n,  /  =  (p*'  •  •  p^")  mod  n 

would  hence  be  a  solution  to  the  congruence  (1).* 

Dixon  s  algorithm  is  a  randomised  approach  to  finding  solutions  to  the  congruence  (2).  The 
algorithm  proceeds  in  a  series  of  rounds  prior  to  each  of  which  a  random  integer  k  is  generated  as 
a  putative  solution  to  (2).  During  a  round  each  of  the  N  primes  p,  is  successively  tested  to  see 
whether  (k^  mod  n)  has  all  its  prime  factors  less  than  m.  If  all  the  N  primes  have  been  tested  and 
a  solution  to  (2)  has  not  been  obtained  then  a  new  round  is  initiated  with  the  generation  of  a  new 
random  integer  k.  Provision  is  also  made  for  an  early  abort  strategy  where  a  round  is  ended  auid  a 
new  random  integer  k  generated  before  all  the  A  primes  have  been  tested.  (See  Pomerance  [2]  for 
more  details.) 

Checking  to  see  whether  a  given  prime  p  less  than  m  divides  the  random  integer  k  during  a 
round  corresponds  to  a  coin  toss  in  our  problem,  with  the  result  a  “head”  if  p  divides  (lb*  mod  n); 
the  total  number  of  available  primes  A'  determines  the  number  of  coins  available.  Success  in  the 
coin  tossing  game  corresponds  to  obtaining  a  solution  to  the  congruence  (2);  the  number  of  “heads” 
needed  in  a  round  of  “tosses”  is  the  number  M(k)  <  A  of  primes  pj  for  which  hi  0  in  (2)  if 
in  fact  the  congruence  can  be  satisfied  for  the  given  value  of  k.  Points  of  departure  from  the  coin 
tossing  problem  are  that  the  probability  that  any  given  prime  p  in  our  store  divides  (b^  mod  n)  is 
1/p  (neglecting  boundary  effects),  and  this  varies  from  prime  to  prime.  Furthermore,  the  number 
.M{k)  of  “heads”  needed  in  a  roi'ind  varies  from  round  to  round  as  the  values  of  k  are  rfuidomly 
generated.  Roughly  speaking,  however,  most  of  the  primes  p  are  of  the  order  of  m  and  most  values 
of  (k-  mod  n)  are  of  the  order  of  n  so  that  an  approximation  to  the  problem  in  terms  of  the  coin 
tossing  game  is  to  consider  a  store  of  A  =  5r(m)  unfair  coins  with  identical  probabilities  1/m  of  a 
toss  resulting  in  a  head  and  require  to  find  an  optimal  strategy  which  minimises  the  total  number 
of  tosses  before  achieving  M  —  log^  n  heads  (see  Section  6). 

In  recent  unpublished  work,  G.  F.  Bachelis  and  F.  J.  Massey  [3]  have  attempted  to  characterise 
optimal  strategies  for  the  coin  tossing  problem  using  elegant  techniques  from  Maukov  decision  theory. 
Formulating  the  game  as  a  Markov  decision  problem,  they  are  led  to  a  consideration  of  a  related 
optimal  stopping  problem  for  a  random  walk  to  obtain  some  general  properties  of  an  optimal  strategy. 
They  also  link  the  coin  tossing  problem  to  a  continuous  analogue  involving  an  optimal  stopping 
problem  for  a  particle  moving  under  Brownian  motion.  While  explicit  closed  form  solutions  for 
an  optimal  strategy  and  the  expected  minimum  duration  of  the  game  remain  elusive,  they  obtain 
.Tsymptotic  results  on  the  expected  duration  of  the  game  for  a  choice  A/  =  N/2  +  (9(  \/A)  (A  —  >c ) 
for  two  suboptimal  strategies;  restarting  only  when  the  number  of  tails  in  a  round  reaches  A'  -  .\f  +  1 , 
and  restarting  when  the  number  of  tails  in  a  round  exceeds  the  number  of  heads  by  a  fixed  amount. 

Our  approach  to  the  problem  here,  in  sharp  contrast,  uses  purely  elementary  techniques.  Our 
main  results,  contained  largely  in  the  next  three  sections,  involve  sharp  estimates  of  the  minimum 
expected  duration  of  the  game,  and  a  specification  of  an  efficient  procedure  for  finding  optima) 
strategies. 


'(’aveat:  unless  k  =  ±1. 
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In  brief,  Section  2  is  devoted  to  a  formalisation  of  the  problem,  and  a  characterisation  of  some 
principal  features  of  an  optimal  restart  strategy.  The  main  result  shown  here  is  that  the  search  for 
an  optimal  strategy  can  be  confined  to  a  relatively  small  family  of  restart  strategies  that  we  call 
parsimonious. 

Section  3  contains  explicit  estimates  of  the  expected  duration  of  the  game.  In  particular,  using 
elementary  arguments  we  obtain  an  explicit  general  form  for  the  expected  duration  of  the  game 
under  any  consistent  strategy.  We  also  illustrate  the  utility  of  the  general  solution  with  calculations 
for  divers  strategies.  With  direct  estimates  of  the  minimum  expected  duration  of  the  game  we  t.ion 
show  that  when  M  <  A^/2  an  optimal  strategy  (and  essentially  any  sensible  strategy)  has  expected 
duration  2M[1  +  o(l)],  when  A/  =  A'/2  an  optimal  strategy  has  expected  duration  between  2Af  and 
4M,  while  for  M  >  N/2  an  optimal  strategy  has  expected  duration  increasing  exponentially  in  Af. 

In  Section  4  we  explicitly  prescribe  a  backward  induction  algorithm  which  efficiently  generates 
optimal  strategies  for  the  coin  tossing  problem,  and  provide  a  proof  of  its  convergence  to  an  optimal 
solution.  The  algorithm  exploits  the  equivalence  between  the  coin  tossing  game  and  a  related  optimal 
stopping  problem  using  the  prior  chziracterisation  of  the  features  of  an  optimal  strategy  (derived  in 
Section  2)  and  the  general  solutions  for  the  expected  duration  of  the  game  (obtained  in  Section  3). 

In  Section  5  we  include  some  numerical  computations  comparing  the  expected  duration  of  the 
game  under  various  strategies  with  the  expected  duration  of  the  game  under  an  optimal  strategy 
obtained  using  the  backward  induction  algorithm.  The  simulations  indicate  that  a  restart  strategy 
introduced  here — the  Ballot  Strategy — has  a  near-optimal  character 

We  conclude  in  Section  6  with  extensions  of  our  results  to  coin  tossing  games  where  the  coin  is 
unfair. 

2  THE  COIN  TOSSING  GAME 

2.1  Restart  Strategies 

Let  us  begin  by  formalising  the  setup  of  the  game.  The  following  three  items  constitute  the  game's 
critical  parameters: 

•  A  sequence  of  symmetric  Bernoulli  trials  taking  values  in  {0,1)  and  denoting  the  results 
of  a  sequence  of  fair  coin  losses.  A  coin  toss  resulting  in  a  “head”  corresponds  to  A\  =  1  and 
is  called  a  “success" 

•  A  running  total  of  successes  S  initialised  with  5  *—  0,  and  the  number  n  of  coin  tosses  in  the 
current  round  also  initialised  with  the  assignment  n  —  0. 

•  A  sequence  (fj,  J  >  1)  =  (/i,/2.  •  •)  called  the  restart  strategy  where  each  fj — a  restart  func¬ 
tion — is  a  randomised  Boolean  function  from  {0, 1, . . . ,  jV)  x  {0, 1, . . . ,  A/  —  1 }  into  {0, 1 }  with 

f  1  with  probability  C>)(n,  m), 

I  0  with  probability  \  —  g>j(n.m). 

Here  4>j{n,m)  denotes  the  probability  of  deciding  to  restart  (in  round  j)  when  m  successes 
have  been  obtained  in  n  <  X  tosses,  we  also  impose  the  constraint  0;(A’,  m)  =  1  for  rn  =  0. 

. . .,  Af  —  1,  i.e.,  a  restart  is  mandated  if  all  N  coins  are  tossed  and  fewer  than  Af  successes 
obtained.' 

The  game  proceeds  iteratively  as  follows,  starting  at  epoch  T  =  1  and  initialised  with  the  number  of 
successes  in  a  round  S  =  0,  the  number  of  coin  tosses  in  a  round  n  =  0,  and  the  number  of  restarts 
I<  =  0: 

^.An  intuilivr  approach  would  be  to  jusl  ronsider  slationary.  nonrandom  rrslart  slralenics  specified  by  /,  =  /. 
;  >  1.  where  /  denotes  any  fixed,  deterministic  function.  It  is  nontrivial  to  determine  whether  or  not  the  iiii  re.v.e<l 
Renerality  espoused  in  the  .setup  here  buvs  improvements,  bee  Lemma  i.I  for  a  resolution. 
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1.  [Accumulation.]  Update  the  running  total  of  successes  and  the  number  of  coin  tosses  in  the 

current  round  of  tosses:  5  •—  5  +  A'r,  n  ♦—  n  +  1. 

2.  [Stop?]  Check  to  see  if  the  number  of  successes  is  equal  to  the  desired  value:  If  5  =  Af  output 

the  duration  of  the  game  T  and  stop. 

3.  [Restart?]  Check  the  restart  strategy  to  see  if  the  game  should  be  restarted:  If  /k+iC^i-^)  =  1 

then  reset  5  *—  0  and  n  •—  0,  and  increment  A’  •—  A’  +  1. 

4.  [Next  epoch.]  Increment  the  epoch  by  one:  T  —  T  +  1.  Go  back  to  step  1. 

Let  S  denote  the  family  of  all  (random)  restart  strategies.  For  a  given  strategy  (/j,  j  >  1),  let 
denote  the  duration  of  the  game.  We  say  that  a  strategy  (/*,  j  >  1)  e  5  is  opitmal  if 
ET|y  =  infs  ET(/j  j>i).  Our  goal  is  to  find  an  optimal  strategy  and  to  estimate  the  minimum 
expected  duration  of  the  game 

£»(Ar,A/)  =  infET(/„j>,).  (3) 

The  model  described  here  does  not  include  a  cost  for  restarting  the  game.  Restarting  costs  can, 
however,  be  incorporated  very  simply;  see  the  remark  following  the  proof  of  Theorem  3.3. 

2.2  Optimal  Strategies 

Consider  a  single  round  of  coin  tosses  governed  by  a  restart  function  /,  i.e.,  the  round  of  tosses 
ends  either  when  M  successes  are  obtained  or  when  a  restart  condition  /(n,m)  =  1  is  encountered. 
Denote  by  the  probability  that  M  heads  obtains,  =  1  —  o/  the  probability  that  a  restart 
condition  is  encountered,  and  tj  the  number  of  coins  tossed  before  the  round  ends.  Now  consider 
a  strategy  (/j,  j  >  1).  Let  the  random  variable  A'  denote  the  number  of  restarts  before  the  game 
finally  terminates  with  M  heads.  It  is  clear  that  K  has  the  distribution 

k 

P{A  =  L-}=a/,^,  1]/?/,,  *  =  0.1,..,.  (4) 

j  =  i 

Now  conditioned  on  the  event  {A'  >  1}  we  have  =  r/,  Further,  the  events 

{A'  =  0}  and  {A'  >  1}  depend  solely  on  /i,  with  a/,  =  P{A'  =  0}  and  ~0f,  =  P{A'  >  1}.  By  a 
simple  conditioning  argument  we  then  have 

Er(;„,>i)  =  E[r,/„^>,)|A-  =  0]P{A'  =  0}  +  E[r(/„,>„|A>  1]P{A>  1} 

=  E(r/.iA-  =  0)  +  [E(r/,|A'>  1)  +  E(r(/„,>2,))P{A  >  1} 

=  E(r/,)  +  E(7(/,,j>2))/?/,.  (5) 

It  follows  by  induction  that 

k  1  k 

0J,  +  E(T(/^_j>t+i)) 

;  =  1  1=1  1=1 

We  can  now  characterise  some  features  of  an  optimal  strategy. 

StaTIONARITY  and  Nonrandomness  We  say  that  a  strategy  (/,,>>  1 )  €  5  is  stationary  if  there 
exists  a  (randomised)  Boolean  function  /  ;  {0, 1, ....  IV)  x  {0, 1 . M  —  1}  — *  {0, 1}  with 

r  1  with  probability  diln, m), 
f(n,m)  -  I  0  with  probability  1  -  (^(n.m), 

and  such  that  fj=f  Tor  each  j  >  1.  we  then  denote  the  strategy  {fj,  j  >  1)  by  (/)  and  denote 
llie  duration  of  the  game  under  strategy  [fj,  j  >  1)  simply  by  7'f  Now  consider  .any 
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stationary  strategy  (/)  6  <5.  A  direct  application  of  (5)  now  yields  tlie  following  fundamental 
result: 

ET;  =  ^.  (/)€S,  Of^o.  (7) 

o; 

We  say  that  astrategy  (/>,  j  >  1)  €  <S  (governed  by  restart  probabilities  ©j(n,m))  is  nonran¬ 
dom  if  €  {0, 1}  for  every  choice  of  0  <  n  <  IV,  0  <  m  <  Af  —  1,  and  j  >  1. 

Lemma  2.1  Thtrt  exists  a  stationary,  nonrandom  optimal  strategy. 


Remark;  Equation  (7)  and  Lemma  2.1  can  also  be  demonstrated  by  setting  up  the  coin 
tossing  problem  as  a  pair  of  nested  Markov  decision  problems  and  appealing  to  results  from 
Markov  decision  theory  (cf.  Ross  [4],  for  instance).  The  prooi  ^^iven  below  is  elementary. 

Proof;  Begin  by  defining  a  linear  order  <  on  5  by  (/j,  j  >  1)  <  (gj,  j  >  1)  if  E7’(/^  ^>1)  < 
Say  that  astrategy  (/,,  j  >  1)  is  decreasing  if 

1  :  >  l)<(A-i+;.  1). 

The  proof  of  the  lemma  now  proceeds  in  several  steps. 

Claim  1:  If  (/;,  J  >  1)  <  (/i+j,  j  >  1)  then  (/,)  <  (/,,  j  >  1). 

From  (5)  and  the  hypothesis  of  the  claim,  it  follows  that 

VVe  can  assume  o/,  >  0  as  otherwise  ET/,  =  ET(y,j>i)  =  oo.  Using  (7)  it  follows  that 
® ^  )/®/i  -  claim  follows. 

Claim  2:  If  (fj,  j  >  1)  is  not  decreasing,  then  there  e.xists  k  >  I  such  that  (/*)  1). 

In  fact,  let  k  be  the  unique  integer  for  which 


(/•+j-J  >l)  <  (/i-i+j.  7  >  1).  l^i^t-l. 

(/*-!+;.  7  >1)  <  (A+,.7>I).  (8) 

(Clearly,  k  is  the  smallest  integer  for  which  (8)  holds.)  By  transitivity  of  the  linear  order  < 
it  follows  that  (/*_!+;,  7  >  1)  <  (/>,  j  >  I),  whereas  by  (8)  and  Claim  1  we  have  (A)  < 
ifk-i+j,  j  >  1).  This  proves  the  claim. 

Claim  3:  inf(;,,j>i)  =  infj;)  E7>. 

By  Claim  2,  it  suffices  to  show  that  inf(/)fsE7/  for  any  decreasing  strat¬ 

egy  (/;.  7  ^  !)•  Now,  if  ET(^^  j>i)  =  oc,  then  (/)  <  {fj,  j  >  \)  for  any  choice  of  /.  So 
now  suppose  that  ETjy^  ^>i,  <  oc.  As  {fj,  j  >  1)  is  decreasing  it  follows  that  the  sequence 
{E7(/^_,^^  j>i),  ifc  >  1}  decreases  monotonically  to  a  finite  limit  T'  as  F  —  oc.  Note  that 
by  (5)  we  have 

=  E[r/J  +  _,>!)]  % 

We  now  assert  that  the  probabilities  are  bounded  away  from  1  for  large  k.  Indeed,  suppose 
the  assertion  does  not  hold.  Then,  for  any  i  >  0,  we  can  find  arbitrarily  large  values  of  k  for 
which  0ft  >  (I  —  Noting  that  Er/^  >  1  for  ail  k.  we  have  the  inequality 


ET,;,. 


I +)  J  ^  *  I 


>  1 


1  ~  ~  J  E[T/.+,,j>i)I  >  I  + 


1  - 


7'* 


7"  =  r*  4-  ( 1  -  .s ) 
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holding  on  an  unbounded  set  of  values  for  k  for  every  choice  of  6  >  0.  On  the  other  hand, 
,j>i)  1  T*  as  lb  —  oo,  so  that  for  any  t  >  0.  for  large  enough  values  of  k  we  have 
T*  <  j>i)  <T’  +(.  Contradiction. 

Now  fix  e  >  0  arbitrarily  small  and  (for  a  suitable  choice  of  6  >  0)  select  k((}  so  large  that  for 
all  k  >  k(()  the  following  relations  hold  simultaneously: 

/?/.<!-  b. 

T*  <  j>i)  <T’  +(. 

By  (5)  we  hence  obtain 

(T*  +  f)  -  =  E(r;J,  k  >  A-(t) 

It  follows  from  (7)  that  for  k  >  k{(), 


r*  +  7  >  T*  +  —  >  =  ETj,  >  infET,,. 

b  -  * 

As  this  holds  for  every  <  >  0  it  follows  that 


r  >  infETr  >  inf  ET, 

“  t-  (/)€5 

This  concludes  the  proof  of  the  claim. 

Claim  4:  There  exists  astationary,  nonrandom  strategy  (/*)  such  that  ETy  =  inf(^)g5  ET/ 

Let  /"*,  I  =  1 . 2^^  enumerate  all  determtntsttc  functions 

/“>  ;  {0,1 . iV}  X  {0,1 . Af-  1}  - {0,1} 

with  /"’(.V,  m)  =  1  for  each  m  and  i.  Now  consider  a  randomised  stationary  strategy  (/)  €  5 
(with  a  corresponding  specification  of  probabilities  <^(n,m)).  In  any  round  /  will  then  have  a 
sample  realisation  with  probability  P{/  =  /•’I)  =  fin  m "'here^ 

o*'\n,m)  =  f^’^(n.m)<t>(n,m)  +  (I  —  f"^(n,m))(l  ~  0(n.  m)) 

_  f  (j)(n,m)  if  /**l(n,m)=  1 

~  \  l-<3i(n,fn)  if /''*(n,  m)  =  0. 

Starting  with  (7),  some  reflection  now  shows  that 


FT  -  -  EL.  E(r/|/=/<’)]P{/ =  /'■>} 

Tj  I  f  — 


EL7E[r/..,lp<’'  ^  .  E(r/,.,] 

— — i - —  >  min— -  =  minET/,.,. 

ELi 


As  the  number  of  deterministic  functions  is  finite,  there  exists  /*€{/*'* . /'^  '}  for 

which  ET/.  =  mini  ET/o.  Hence  (/*)  <  (/)  for  every  (/)  €  5.  This  completes  the  proof  of 
the  claim. 

Claims  3  and  4  complete  the  proof  of  the  lemma.  I 


^This  tacitly  assumes  that  for  any  choices  On  m  €  (0. 1}.  the  events  {/(n.m)  =  o„  are  independent  over  all 
choices  of  (n.m).  As  ran  be  readily  seen,  the  proof  works  without  chanKe  even  if  the  nulependcnre  assuniplion  is 
relaxed,  i.e..  the  joint  distribution  of  the  random  variables  f{n,m)  is  not  a  product  distribution. 


Hu,  Venkatesh 


Consistency  We  say  that  a  stationary,  nonrandom  strategy  (/)  is  consistent  (or  simply,  /  is 
consistent)  if  /(n,m)  is  increasing  in  n  and  decreasing  in  m,  i.e., 

f(n,m)  =  1  =>  f(kj)  =1  {k>  n,  I  <  m). 

Lemma  2.2  There  exists  a  consistent  optimal  strategy. 

This  is  in  keeping  with  intuition;  if,  for  instance,  f{n,m)  —  1  but  f{k,m)  =  0  for  some  k  >  n 
then  we  would  expect  the  strategy  to  be  suboptimai  as  increasing  the  number  of  tosses  m 
a  round  while  keeping  the  number  of  successes  fixed  at  m  would  not  seem  to  improve  the 
situation.  We  defer  a  proof  of  this  assertion  till  Section  4  where  we  provide  a  simple  direct 
proof  by  considering  a  related  optimal  stopping  problem. 

If  /  is  consistent,  we  define  the  restart  boundary  F(n)  of  /  by  F(n)  =  max{m  :  f(n,m)  =  1) 
if  there  is  any  m  for  which  /(n,  m)  —  1;  else  we  set  F{n)  =  — 1.  We  will  also  occasionally  refer 
to  F{n)  as  the  restart  boundary  of  the  consistent  strategy  (/).  Consistent  strategies  can  be 
conveniently  represented  by  means  of  the  restart  boundary  in  the  (n,  m)  plane.  Some  examples 
are  illustrated  in  Fig.  1.  Note  that  the  boundary  of  consistent  strategies  is  monotone. 


(c) 


Figure  1:  (a)  A  consistent,  parsimonious  strategy;  (b)  a  consistent  strategy;  (c)  an  inconsistent 
strategy.  The  parameter  n  represents  the  number  of  coin  tosses  in  a  round,  and  m  denotes  the 
number  of  heeuls  obtained.  The  dotted  lines  show  some  possible  sample  paths.  The  dsished  lines 
indicate  the  termination  boundary  (Af  heads  achieved),  and  the  solid  lines  enclosing  the  shaded 
areas  denote  the  restart  boundaries.  The  shaded  areas  correspond  to  the  restart  region. 

Parsimony  We  say  that  a  consistent  strategy  (/)  is  parsimonious^  (or  simply,  /  is  p.arsimonious) 
if  there  are  Af  distinct  points  of  increase  on  the  boundary,  i.e.,  F(n)  <  F(it  +  1)  <  F(n)  -t-  1. 
0  <  n  <  N .  In  particular,  define  the  sequence  of  points  by 

rim  =  min{n  .  /(n,  m)  =  1 },  m  =  0. 1, . . , ,  Af  -  1 . 

The  points  n„  define  the  points  of  increase  of  the  boundary  F{n).  For  a  parsimonious  strategy 
if)  then,  we  require  that  Um+i  >  nm,  0  <  m  <  A/  —  2;  in  particular,  F{nm)  =  m.  0  <  rn  < 
Af  -  1 

' TliP  tcrininolofO'  relates  to  the  size  of  penrulled  jiiii’ps. 
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Lemma  2.3  There  easts  a  parstmontoits  optimal  strategy,  t.e.,  a  consistent  optimal  strategy 
(/■)  with  boundary  F*  satisfying  F‘{n)  <  F'(n  +  1)  <  F*(n)  4-l,0<n<yV  —  1. 

Proof;  Consider  a  consistent  optimal  strategy  with  boundary  F*.  Let  S„  (n  >  1)  denote 
the  number  of  successes  in  n  tosses  in  any  round  following  a  restart.  Assume  that  Si  >  F*(i), 
i  <  n  and  S„  =  F*(n)  +  1.  Thus,  the  game  proceeds  without  restarting  after  the  nth  toss  in 
the  round.  If  F’(n  +  1)  >  F*(n)  +  2  then  a  restart  is  forced  regardless  of  the  result  of  ihe 
(n  +  l)th  toss.  But  this  implies  that  replacing  the  boundary  point  F'(n)  by  F*(n)+  1  yields 
a  superior  strategy,  this  contradicting  the  supposed  optimality  of  F".  I 


Restart  on  Failure  What  if  in  any  restart  the  first  toss  results  in  a  tail?  The  following  assertion 
maintains  that  an  optimal  strategy  would  restart  if  a  round  begins  with  a  failure. 

Lemma  2.4  There  exists  a  parsimonious  optimal  strategy  whose  boundary  F*  has  its  first 
point  of  increase  at  no  =  1.'  F*(0)  =  —  1.  F*(no)  =  F*(l)  =  0. 

Proof:  If  a  round  begins  with  a  failure  there  are  two  options:  continue  the  round  or  restart. 
Consider  sample  paths  in  the  (n.m)  plane  where  n  denotes  the  number  of  tosses  in  a  round 
and  m  denotes  the  number  of  successes.  Restarting  implies  starting  a  sample  path  anew  from 
(0,0),  while  continuing  the  round  implies  initiating  sample  paths  from  (1,0).  Now  consider 
the  boundary  F  of  any  consistent  strategy  and  assume  F(0)  =  F(l)  =  —1.  Given  that  a 
round  begins  with  a  fsiilure,  it  is  clear  that  the  number  of  sample  paths  lying  strictly  above  the 
boundary  F  is  larger  if  the  game  is  restarted  so  that  replacing  the  boundary  point  F(l)  =  -1 
hy  F(l)  =  0  yields  a  superior  (or  at  any  rate,  not  inferior)  strategy.  I 


We  encapsulate  our  findings  in  the  following  statement. 

Theorem  2.5  Let  V  C  S  denote  the  family  of  parsimonious  strategies  whose  boundaries  have  their 
first  point  of  increase  at  no  =  1.  Then 

D{N ,  A/ )  =  irif  E j =  inf  E  Tj . 

Remarks;  Note  that  we  are  ensured  that  the  probability  of  obtaining  M  heads  in  any  round  of 
tosses  is  non-zero  for  any  parsimonious  strategy  which  has  a  first  point  of  increase  at  no  >  1.  These 
two  conditions  hence  eliminate  strategies  for  which  success  is  impossible  [see  the  consistent  strategy 
shown  in  Fig.  1(b)].  Note  also  that  the  restriction  to  stationary,  nonrandom  strategies  reduces  the 
search  problem  from  an  infinity  of  possible  stationary,  random  strategies  to  a  set  of  2^*^  stationary, 
nonrandom  strategies.  Restricting  attention  to  consistent  strategies  further  reduces  the  search  space, 
and  the  constraints  of  parsimony  with  a  first  point  of  increase  tJo  =  1  results  in  the  reduced  search 
.space  V  of  («!',)  strategies.  We  outline  a  method  for  efficiently  searching  the  space  V  for  an 
optimal  strategy  in  Section  4. 

3  EXPECTED  DURATION 

We  consider  now  the  problem  of  estimating  the  minimum  expected  duration  of  the  game  D(A',.V/). 
We  begin  by  showing  a  strict  lower  bound  for  D(N,M),  valid  for  every  fixed  M .  Recall  that  we  can 
restrict  ourselves  to  considering  the  subfamily  of  stationary  strategies. 

Theorem  3.1  For  every  M.  jD(.V.,U)}  is  a  monotone  sequence  in  X  decreasing  to  the  limit 
=  2M  as  X  —  x. 
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Proof:  Fix  A/.  Let  S(N)  denote  the  family  of  stationary  strategies  when  the  number  of  available 
coins  is  TV.  If  JV  <  N'  it  is  clear  that  S(N)  is  contained  in  S(N')  in  the  following  sense:  for 
any  strategy  (/)  in  S(N)  there  exist  strategies  (/')  in  S(N')  such  that  the  restriction  of  /'  to 

{0 . N}  X  {0 . A/  —  1}  is  equal  to  /;  in  addition,  it  is  easy  to  see  that  these  strategies  will  result 

in  the  same  expected  duration  as  (/)  because  of  the  restriction  f(N,m)  =  1.  Hence  D(N',M)  < 
D{N,M). 

Now  assume  that  there  is  an  infinite  store  of  coins.  It  is  clear  that  an  optimal  strategy  is  one 
that  never  restarts  the  game.  Let  D(oo,Af)  denote  the  expected  duration  of  the  game  under  this 
optimal  strategy.  Let  a\f(ib)  denote  the  probability  that  the  A/th  success  occurs  at  the  tth  trial. 
As  a  first  pcissage  to  M  involves  a  prior  first  passage  through  A/  —  1  we  have  the  convolutional 
relationship 

t-i 

«A/(^-)  =  -  j). 

j  =  i 

Consider  the  generating  function  Asf(s)  =  As  oi(0)  =  0  and  on(k)  = 

‘2"*,  we  have  Ai(s)  =  s/(2  —  s).  It  follows  that  D(oo,M)  =  A'nf{l)  =  2M .  I 

Consider  any  consistent  strategy  (/).  As  before,  define  the  sequence  of  points  rim,  0  <  m  < 
M  -  1  by 

rim  =  min{n  :  /(n,m)  =  1}. 

{If  {/)  is  parsimonious,  these  are  just  the  A/  point  of  increase  of  the  boundary.)  Now  define  the  sets 
of  pairs 


Q  =  {{n,M):Q<n<N}, 

R  =  {(rim.m)  :  0  <  m  <  A/ —  1). 

The  set  Q  can  be  identified  as  the  success  termination  state  (i.e.,  M  heads  are  achieved  in  the  round), 
and  the  set  R  as  the  restart  state  (i.e.,  the  restart  boundary  is  encountered  during  the  round).  Let 
Sn  denote  the  number  of  heads  obtained  in  a  sequence  of  n  <  N  tosses.  We  momentarily  suppress 
the  dependence  on  /  and  write 


r  =  min{n  :  (n.Sn)  €  QU  /?} 

for  the  number  of  tosses  in  a  round  before  either  the  game  ends  with  A/  successes,  or  the  round  of 
tosses  is  terminated  and  the  game  restarted.  Also  let 

Pm  =  =  nm,Sr  =  m)  (9) 

denote  the  probability  that,  in  any  round  of  tosses,  a  restart  occurs  following  the  rimth  toss.  (On 
the  (n.m)  plane  then,  this  corresponds  to  a  sample  path  which  lies  above  the  boundary  for  n  < 
and  intersects  the  boundary  at  the  point  (nm,m).)  It  follows  that  li  -  probability 

that  a  restart  occurs  during  any  round  of  tosses,  and  o  =  1  —  probability  that  the 

game  terminates  with  A/  successes  in  a  given  round  of  tosses. 

The  following  recursive  form  for  the  probabilities  Pm  admits  of  efficient  evaluation. 

Lemma  3.2  The  probabtltUes  Pm,  0  <  m  <  M  —  \  satisfy  the  following  recursion: 


base:  Po  =  2""®, 

recursion:  Pm  =  (”;;•)  2—-  -  I  r)  p.,  m  >  1. 


(10) 
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Proof:  It  is  clear  that  po  =  P{5„o  =  0}  =  2-"'’.  Now  consider  1  <  m  <  M  -  1.  Rewriting  pm  in 
an  alternative  form,  we  have 

=  P{S„„  =  m,r  >  n„_i}  =  P{S„„  =  m}  -  P{S„„  =  m,  r  <  n„_i } 

m  —  1 

2“""'  -  ^  P{5„„  =  m.r  =  n,).  (11) 

1=0 

For  i  <  m.  tlie  event  {5n„  =  m,r  =  n,}  is  contained  in  the  set  of  sample  points  {Sn,  =  «};  it  hence 
follows  that 

P{5„„  =  m,r  =  n,}  =  P{5„„  =  mKr.Sr)  =  (n,,  i)}  P{(r.  5^)  =  (n. ,  i)} 

=  P{S„„  -  S„.  =  m- i}p, 

the  Icist  but  one  equation  following  from  the  independence  of  the  trials.  Substituting  in  (11)  com¬ 
pletes  the  pro  r.  I 


The  probabilities  pm  (corresponding  to  any  given  consistent  strategy  (/))  turn  out  to  be  ex¬ 
traordinarily  useful  as  they  allow  us  to  write  down  an  explicit  expression  for  the  expected  duration 
of  the  game  under  strategy  (/). 


Theorem  3.3  For  any  constsienl  strategy  (f), 


ETj  =2M  + 


^  Z!m=0  ’^Pm 


(12) 


Remarks;  Note  that  to  compute  ETj  it  suffices  to  specify  the  points  of  increase  n^,  0  <  m  < 
M  -  1.  Note  also  that  ETj  >  2M  in  accordance  with  Theorem  3.1. 


Proof;  An  appeal  to  (7)  yields  ETj  =  E(r)/a.  Now  St  =  so  that  Walds  equation 

gives  us  ESr  =  (Er)(E  A'^)  =  E(r)/2.  We  hence  have 

ET,  =  -ESr  =  -(E(5.;(r,5r)€g]-l-E[S.;(r.S.)e  R)) 

Q  a 

=  -(A/Q  +  E[5r;(r,Sr)e  R])  =  2M -I- -  E[5r;(r,5r)  G  R]. 

Q  a 

With  Pm  defined  as  in  (9),  we  now  have 

.W-l  Af-l 

E[5r;(r,5r)  e  R]  =  ^  mP{(T,5T)  =  (n^.m)}  =  ^  mp„. 

m  =  0  m=0 

Recalling  that  o  =  1  -  23m=o  Pm  completes  the  proof.  I 


Remark:  The  model  we  are  concerned  with  in  this  paper  does  not  impose  a  “cost”  for  restarting. 
A  more  general  situation  where  there  is  a  cost  c/t  associated  with  k  restarts  is,  however,  easily 
incorporated  into  the  analysis.  For  instance,  if  A'  denotes  the  number  of  restarts,  let  us  define  the 
cost  Cf  of  the  consistent  strategy  (/)  by 

Vifc  >  0  :  f.'/  =  Ck  E[r/|A  =  with  probability 


Hu.  Venkatesh 


11 


Note  that  K  has  the  geometric  distribution  (see  (4))  so  that  q/J*  is  just  the  probability  of  the  event 
{K  =  it}.  The  expected  cost  of  strategy  (/)  is  hence 

OO  CO 

E  <7/  =  c*  E[7> ;  A-  =  ;t]  =  a  ^  E[Tf  |  A'  =  k]. 

k  =  0  k=0 


Let  us  formally  define 

OO  OO 

C,  (72  =  a^ibct/?^ 

ic=0  t  =  0 

Following  the  previous  analysis  we  can  now  readily  show 

ECf  =2CiA/+ 

Z^m=0 

For  the  rest  of  the  paper  we  adopt  the  constant  cost  model,  c/t  =  1  (it  >  0).  It  is  easy  to  verify  that  for 
this  model  Ci  =  1  and  C?  =  rf/o  so  that  ECj  =  ETj  agrees  with  Theorem  3.3.  As  instances  of  other 
cost  models,  a  linear  cost  model  c^  =  -/k  results  in  Ci  =  y0lot  and  Cj  =  y/d{l  +  13)/ a-,  and  an  ex¬ 
ponential  cost  model  c*  =  (with  0  <  1/0)  results  in  Ci  =  Xq/(\—60)  and  Ct  =  X6a0/{\—O3)~. 


The  following  examples  illustrate  the  utility  of  Lemma  3.2  and  Theorem  3.3  with  explicit  calculations 
of  the  expected  duration  of  the  game  for  diverse  strategies. 

Example:  Indolent  Strategy 

The  strategy  of  indolence  (//)  prescribes  that  we  wait  til!  all  N  coins  are  tossed  before  restarting 
the  game.  In  particular,  the  strategy  is  specified  by 


fi(n.m) 


r  0  if  0  <  n  <  N  -  1, 
\  1  \In  =  N. 


The  strategy  is  clearly  consistent  (but  not  parsimonious)  and  we  have  0  <  m  <  Af  —  1  (see 

Fig.  2).  Direct  calculation  then  yields 


m 


Pm  =  P{5n  =  m}  = 


m  =  0, 1, . . . ,  A/  —  1. 


(13) 


2E!f:o' »«(:) 


ET/  =  2A/  4- 


■-St 


C) 


Equation  (12)  now  readily  yields 


Hu,  V'enfcates/i 


12 


for  the  expected  duration  ETj  of  the  game,  I 

Example:  Hope  Springs  Eternal 

This  strategy  restarts  the  game  only  when  the  number  of  tails  in  a  round  of  tosses  reaches 
.V  -  :\f  +  1,  i.e.,  it  is  futile  to  proceed  with  the  round  any  further.  The  strategy  is  parsimonious  and 
its  boundary  Fn  is  given  by 


„  f  -1  i(0<n<  N  -  M, 

l-Hin)  -  I  „  _  _v  ,  if  _  Af  +  1  <  n  <  .V. 

The  points  of  increase  of  the  boundary  are  =  m  +  ;V  —  A/  4-  1 .  0  <  m  <  A/  —  1  (see  Fig.  3).  We 

m 


Figure  3:  Tlie  boundary  for  Hope  Springs  Eternal 

will  find  it  convenient  to  use  the  representation  n,n  =  with  no  =  N  —  M  +  1  We  now  claim 

that 

p„  =  ~  2-*"“+"”,  m  =  0 . A/-1.  (14) 

VVe  prove  the  result  by  induction  on  m.  It  is  clear  that  po  =  P{5no  =  0}  =  2'"",  so  that  ( 14)  holds 
for  m  =  0.  Let  us  now  assume  that  (14)  holds  for  i  <  m—  1.  As  nm  -  n*  =  m  -  i.  it  follows  from  (10) 
and  the  inductive  hypothesis  that 


(""D 

m- 1 

2"*"*”**p, 

ISO 

^-(no+m) 

fno  +  m  -  A 

the  last  step  following  by  repeated  application  of  the  binomial  identity 

C-.)*0=C:')- 


This  concludes  the  proof  of  the  claim  (14).  Equation  (12) 


E  Th  =  2M  + 


1 (■”*„"■ 


for  the  expected  duration  ET//  of  the  game. 


I 
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Example:  Step  in  Time 

This  strategy  decrees  a  restart  when  the  number  of  tails  leads  the  number  of  heads  by  a  fixed 
amouni.  say  L.  This  is  again  a  parsimonious  strategy  with  boundary  Fs  given  by 

{-1  ifO<n<L-l. 

n-  L  ifL<n<L  +  Af-l. 

A/-1  HL-{-M<n<N. 

Hope  Springs  Eternal  is  hence  a  specific  case  of  this  strategy  with  the  choice  L  =  N  —  AI  +1.  The 
points  of  increase  of  the  boundary  are  =  m  +  L,  0  <  m  <  A/  —  1  (see  Fig.  4).  It  is  readily  seen 

m 
M 


=  l 
n 


Figure  4;  The  boundary  for  Step  in  Time. 

that  this  is  equivalent  to  Hope  Springs  Eternal  with  a  reduced  store  of  N'  =  L  +  A/  —  1  coins.  The 
probabilities  pm  for  Step  in  Time  can  hence  be  directly  inferred  from  (14): 

rn  =  Q,...,M -\.  (15) 


An  appeal  to  ( 12)  again  yields 


ETs  =  2A/  + 


I 2-"-+'”' ('■*:-■) 


for  the  expected  duration  ETs  of  the  game.  ■ 

Example:  Ballot  Strategy 

Consider  the  parsimonious  strategy  indicated  schematically  in  Fig.  5.  The  points  of  increase 
of  the  boundary  of  the  strategy  are  specified  by  n„^  =  1  +  fm/t],  0  <  m  <  Af  —  1,  where  h  — 
{N  -  l)/(Af  -  1).  (Note  that  no  =  1  and  nAf_i  =  N.)  It  is  easy  to  verify  that  the  following 
inequalities  hold: 

A/  -  m  M 

-  >  —.  m  =  0 . A/  —  1 . 

l\  -  rim  'N 

Thus,  we  can  interpret  the  Ballot  Strategy  as  follows:  the  strategy  decrees  that  a  round  be  restarted 
iff  the  ratio  of  the  number  of  additional  heads  needed  to  the  number  of  remaining  coins  is  larger 
than  it  was  at  the  start  of  the  game.^ 

For  general  values  of  A',  a  closed  form  for  the  probabilities  Pm  is  hard  to  secure,  and  the  general 
recursion  (10)  must  be  appealed  to.  When  A'  is  an  integer,  however,  the  following  explicit  form  can 
be  obtained: 

Pm  =  - ^ +  n  m  =  0,...,A/ -  1.  (16) 

mA  +  1  V  ^  J 


''.As  in  the  classical  ballot  theorem  (rf.  Feller  (S],  for  instance),  for  a  round  to  continue,  we  require  the  iiuiial  ratio 
.\///V  of  heads  to  coins  to  lead  throughotit  the  round. 
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Tlie  proof  is  again  by  induction  on  m.  As  no  =  1,  it  is  clear  that  po  =  P{5i  =0}  =  1/2,  so  that  (16) 
holds  for  T7i  =  0.  Now  assume  that  (16)  holds  for  i  <  m  —  1.  Note  the  equivalent  representation 

Pi  =  j  i  =  1 . m  -  1. 

As  Tim  —  n,  =  (m  —  i)/\,  it  follows  from  (10)  and  the  inductive  hypothesis  that 


Pm 


2  - 1  m  A'  +  1 ) 


2-(mA'  +  l) 


/  mK  1  /  i/v  N  /(m  -  i)K\ 

[pi  -  ij  ^  i\i  —  ) 


An  appeal  to  the  general  combinatorial  identity  [6] 


completes  the  proof  of  (16)  for  integer  I\.  Equation  (12)  now  yields 


ETb  -2M  + 


4Em=i  (r-".) 


for  the  expected  duration  ETb  of  the  game  when  K  is  an  integer. 

The  general  system  of  probabilities  {pm}  (for  arbitrary  K)  corresponding  to  the  Ballot  Strategy 
can  also  pe  determined  from  the  generating  function  for  which  an  explicit  form  can  be  shown.  Define 


’’n  =  P{r  =  n.Sr  =  (n  -  1)/A'},  n=l,2 . 

Note  that  t„^  =  Pm  for  0  <  m  <  A/ —  1.  (In  general,  r„  =  0  if  n  5^  1  +ml\  for  some  positive  integer 
m.)  Thus,  iS  =  Em=o  Pm  -  En=i  f*'®  probability  of  a  restart  in  a  round,  and  o  =  1  - is  the 
probability  of  attaining  M  successes  in  a  given  round.  Let 


f  =  inf{n  :  5„  -  (n  -  1)/A'  <  0). 

Clearly,  f  has  distribution  {th}.  Consider  the  generating  function  Gt(s)  =  t„s" .  We  can  now 

directly  apply  a  result  from  Feller  [5,  page  413]  to  obtain  the  expression 


Gf(s) 


1  —  exp 


-f;-P{5„  >(n-  D/A} 

n=l  J 


-i: 


(s/2)" 


i: 


1  — 


n 
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The  distribution  {r„}  is  thus  completely  determined,  and  hence  so  are  the  probabilities  {pm}-  I 

While  explicit  estimates  of  the  duration  of  the  game  under  any  strategy  are  thus  readily  gen¬ 
erated  from  the  recursion  (10),  a  careful  analysis  of  the  (asymptotic)  behaviour  of  the  expected 
duration  of  the  game  under  a  given  strategy  requires  substantial  algebraic  effort,  even  for  the  simple 
strategies  described  here.  Nonetheless,  sharp  results  (Theorem  3.6;  see  also  the  remarks  following 
the  theorem;  can  be  inferred  about  the  minimum  expected  duration  D{N,M)  of  the  game  by  a 
consideration  of  the  simplest  of  strategies — the  Indolent  Strategy. 

We  begin  with  a  preliminary  result  due  to  W,  lloeffding  (7). 

Lemma  3.4  Let  N  be  any  positive  integer,  rj  6  (0, 1)  and  c  6  (0, 1)  any  fixed  parameters.  Then: 


max  i  Y. 

E 

p,"-n 

V  m<{\~c)rjN 

\  / 

m>(l  +  c)f|N 

^  ^  J 

This  exponential  bound  for  the  tail  of  the  binomial  is  not  the  sharpest  possible,  but  suffices  for  our 
purposes. 

Theorem  3.5  Let  T/  denote  the  duration  of  the  game  under  the  Indolent  Strategy.  Then 

^ET,  <  D(N,Af)  <ET,. 

Proof:  The  upper  bound  for  D(N,M)  is  obvious.  Now  recall  that  for  any  consistent  strategy 
(/),  we  have  from  (7)  that  ETf{N,M)  =  E{Tf)/a/.  Let  r/  and  or  denote  the  corresponding  values 
of  Tf  and  Of  for  the  Indolent  Strategy.  It  is  clear  that  o/  =  sup^  o/.  Also,  for  any  parsimonious 
strategy  (/)  €  V,  we  have  \  <  Erf  <  N .  Thus,  D(N,M)  >  inf(/)gT>  E ry/sup(^)gp  o/  >  l/o;  > 
Eri/Na,  =  ETi/N.  I 

We  can  now  directly  apply  Theorem  3.5  to  obtain  the  following  result  which  shows  two  distinct 
domains  of  behaviour  for  the  minimum  expected  duration  of  the  game. 

Theorem  3.6  Let  c  >  0  be  any  fixed  parameter  and  N  any  positive  integer.  Then: 

(a)  IfM<  \{\-c]N,  then  2M  <  D{N,M)  <  2M  [l -b 

(b)  IfM  =  N/2,  then  N  <  D{N,N/2)  <  2N : 

(c)  If  M  >k{l  +  c)N.  then  D(N,M)> 

Remarks:  Slightly  tighter  (if  messier)  exponents  can  be  obtained  using  Chernoff's  bound  for  the 
tail  of  the  binomial  instead  of  the  lloeffding  bounds  of  Lemma  3.4.  In  particular,  we  can  replace 
the  exponents  c^/2  by  ln2  —  //[(I  —  c)/2]  throughout  in  the  above  bounds  for  D(N,M).  Here  In 
denotes  a  logarithm  to  base  e  and  H{x)  =  -x  inx  -  (1  -  x)  ln(l  -  r)  is  the  binary  entropy  function 
in  nats.  As  Chernoff’s  bound  is  known  to  be  exponentially  light,  this  in  particular  implies  the 
following  stronger  asymptotic  result:  If  M  =■  -5(1  +  c)N  -t-  o{N)  for  any  positive  constant  c,  then 
ln£)(Af,M)~pn2-ff{(l-c)/2}]A  as  AT  —  00  . 

Note  the  abrupt,  threshold  change  in  behaviour  of  the  optimal  strategy  around  M  =  A/2 
where  the  minimum  expected  duration  of  the  game  goes  from  linear  to  exponential  in  M.  The 
moral  of  the  story  is  that  for  M  <  .V/2,  essentially  any  strategy  (including  the  Indolent  Strategy!) 
yields  performance  comparable  to  the  optimal  strategy:  all  strategies  in  this  regime  have  expected 
durations  2M  -f  0(e~''^)  for  a  positive  constant  C\.  For  M  >  A/2  on  the  other  hand,  all  strategies 


Hu,  Venkatesh 


16 


have  expected  duration  for  a  positive  constant  cj. 


Proof:  Consider  the  Indolent  Strategy.  When  M  =  ~  some  fixed  choice  of  c  >  0, 

Lemma  3.4  and  (13)  yield  the  bound® 


Af-l 


mssO 

A  similar  application  of  Lemma  3.4  yields 
M-l  M-l 


a=l-^p„=l-2-5:n>l-e-^^'^ 


msO 


M-l 


m=0 

From  (12),  we  then  have 


rnpm  <  Af  P'"  ~  XI  (m)  “ 


m=0 


msO 


ET/  <  2A/  + 


1  _e-c"/v/2' 


M<-(l-c)N, 


(17) 


the  result  holding  for  M  <  ^(1  —  c)N  because  ET/  decreases  monotonically  as  M  decreases  for 
every  fixed  N. 

Now  consider  the  case  M  =  iV/2.  From  (13)  we  again  have 

m=0  m<N/2-l  '  '  \V/V/ 


and  ^ 

^  rnpm  <  M2"^  ^ 

n»=0  m<N/2-l 

Substituting  in  ( 12)  we  have 


<E-!i 

mj  ~  2  ~  4' 


ET/ <  yv  +  AT  |l  -  <  2iV,  M  =  N/2. 


(18) 


Finally,  when  M  =  ^(1  +  c)N  for  any  fixed  choice  of  c>  0,  Lemma  3.4  and  (13)  again  yield  the 
bound 

«  =  E  2-"  <  e-’"'’. 

m>(l/2+e)N 

Also,  Er  >  1.  We  then  have  from  (7)  that 


ET/ =  —  >  M>hl  +  c)N, 

a  ~  —  2 


(19) 


the  bound  holding  for  Af  >  ^(1  +  c)N  again  by  the  monotonicity  of  ET/. 

The  lower  bound  on  D{N,M)  in  part  (a)  of  the  theorem  follows  from  Theorem  3.1,  while  the 
upper  bound  follows  from  (17)  and  the  upper  bound  of  Theorem  3.5.  The  lower  bound  for  D(N,  M) 
in  part  (b)  of  the  theorem  again  follows  from  Theorem  3.1,  while  the  upper  bound  follows  from  (18) 
and  the  upper  bound  of  Theorem  3.5.  Part  (c)  of  the  theorem  follows  from  (19)  and  the  lower  bound 
of  Theorem  3.5.  I 


‘'We  ignore,  for  the  sake  of  notational  economy,  the  fairly  transparent  details  with  regard  to  rounding  to  the  nearest 
integer. 
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When  M  <  N/2,  the  expected  duration  for  the  Indolent  Strategy  is  (from  (17)  and  (18)) 
within  an  additive  factor  of  N  of  the  minimum  expected  duration  of  the  game.  When  M  >  ^(1  + 
c)N ,  the  probability  of  achieving  M  successes  in  a  round  is  exponentially  small:  a  =  0(c~'^’^/^). 
The  expected  duration  of  a  round,  on  the  other  hand,  increases  linearly  at  best;  1  <  Er  <  .V. 
From  (7)  and  the  above  proof,  it  hence  follows  that  the  expected  duration  of  the  game  is  dictated 
predominantly  by  the  factor  1/a  in  this  regime.  Note  that  the  largest  value  for  a  obtains  for  the 
Indolent  Strategy,  so  that  we  obtain  the  sharpest  exponent  with  E7V  >  for  this  strategy. 

An  optimal  strategy  can  save  at  best  in  the  Er  term  (bounded  between  1  and  N),  but  cannot  obtain 
better  exponents. 

Explicit  coefficients  can  also  be  obtained  for  the  other  strategies  we  have  considered  by  using  (10) 
and  (12).  For  instance,  consider  the  strategy  Hope  Springs  Eternal.  From  (14)  we  can  write 

„  -  /"o  +  m- n  /no■^m-2^  /no\ 

V  2m  y  V  2(m-  1)  /  V  2  / 

Each  product  term  is  of  the  form 

Tip  +  m  -  fc  -  1  _  (np  -  1)  -f  (m  -  <:) 

2(m— )b)  (m  —  k)  +  (m  —  k) 

If  A/  <  -1(1  -c)N,  then  N  -  M  >m,  0<m<Af— l,so  that  each  of  the  product  terms  is  larger 
than  1.  (Recall  np  =  AT  —  Af  +  1.)  Hence  Pm  is  monotone  increasing.  Hence 

E(5r;(r,5,)€/2]  < 


where  c\  is  a  positive  constant  which  can  be  expressed  in  terms  of  the  binary  entropy  function. 
Further,  o  =  1  —  (9(e“‘’^)  for  a  positive  constant  cj.  Thus,  ET//  =  2Af[l  + 0(e“'‘^)],  as  expected. 
Similarly,  using  (15)  for  Step  in  Time,  we  can  readily  obtain  ETs  =  2Af  [1+  C?(c”'’^)]  for  a  positive 
constant  C3  when  M  <  1).  Bachelis  and  Massey  (3]  have  done  a  careful  asymptotic  analysis 

of  the  strategies  Hope  Springs  Eternal  and  Step  in  Time  for  a  choice  of  M  =  N/2  +  0{\/N). 

Similar  estimates  can  also  be  readily  derived  for  the  Ballot  Strategy  using  (16).  In  particular,  the 
summands  in  the  sum  23m=o  ^re  of  the  form  (J’r ,)  2~^"*  which  are  6(m~  *^*).  If,  for  instance, 

M  =  N/2,  then  evaluates  to  9(>/N).  Similarly,  the  sununands  in  0  =  J^^Zp  Pm  are 

of  the  form  p„  =  (’V ')  =  ©(m-^/^).  For  M  =  N/2  then,  0=1-  ©(TV'/^),  so 

that  o  =  ©(1V~*/^)  in  this  regime.  It  follows  that  ETb  =  ©(^)  when  M  <  N/2.  Of  course,  a  more 
careful  analysis  of  these  expressions  is  needed  if  exact  coefficients  are  required. 


A‘o  +  Af  -  2'\ 

V  np  -  1  ) 


4  OPTIMAL  STOPPING 

The  renewal  property  of  our  coin  tossing  game  allows  us  to  consider  a  somewhat  simpler  optimal 
stopping  problem  in  order  to  determine  the  boundary  of  an  optimal  strategy.  Recall  that  our  basic 
problem  is  to  determine  a  nonrandom  Boolean  function  /*  such  that  (/*)  is  an  optimal  restart 
strategy.  A  heuristic  approach  towards  specifying  /*  is  as  follows:  assign  a  current  cost  of  n  if  the 
round  is  continued  after  n  tosses,  and  assign  a  higher  restart  cost  of  <f  +  n  if  there  is  a  restart.  The 
idea  then  is  to  choose  /*  to  minimise  the  expected  cost.  (Note  that  a  restart  is  mandated  if  N  tosses 
result  in  fewer  than  M  heads;  thus  /*  will  favour  continuing  the  round  for  small  values  of  n,  but 
will  favour  restarts  when  n  becomes  large  and  the  number  of  heads  is  less  than  M .)  This  related 
problem  is  an  optimal  stopping  problem,  and  as  we  will  see  shortly,  this  will  be  equivalent  to  our 
original  coin  tossing  problem  for  an  appropriate  choice  of  parameter  d.  Bachelis  and  Massey  [3]  also 
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consider  a  similar  approach,  though  the  algorithm  given  nere  is  somewhat  more  direct,  iteratively 
using  the  estimates  (12)  obtained  in  Theorem  3.3. 

Consider  a  sequence  of  n  tosses  of  a  fair  coin  stopping  on  or  before  the  y\^th  toss.  Let  d  >  0 
be  some  fixed  positive  real  number.  Stopping  the  sequence  of  tosses  at  trial  n  G  {0, . . . ,  N}  resuiis 
in  an  allocation  of  a  cost  as  follows:  if  the  number  of  heads  is  less  than  M  the  cost  is  (d  +  n);  if 
the  number  of  heads  is  greater  than  or  equal  to  M  the  cost  is  n.  The  optimal  stopping  problem 

is  to  decide  on  a  (randomised)  stopping  rule  f  :  {0, ...,7^^}  x  {0 . M  -  1}  — ►  {0,1}  which  will 

minimise  the  cost.  More  formally,  let  /  denote  a  stopping  rule  for  the  optimal  stopping  problem, 
and  the  random  variable  R/  denote  the  cost  assigned  under  /.  The  optimal  stopping  problem  is 
now  to  find  an  optimal  stopping  rule  /*  such  that 


Efi,.  =  infEfl/.  (20) 

The  technique  of  backward  induction  (see,  for  instance,  Chow,  Robbins,  and  Siegmund  [8]) 
can  be  applied  to  this  optimal  stopping  problem  to  generate  a  nonrandom  optimal  stopping  rule 
(recall  Lemma  2.1).  Informally,  the  procedure  asserts  that,  after  the  nth  toss,  it  is  worth  continuing 
the  sequence  of  coin  tosses  if  the  conditional  expected  cost  given  the  results  of  the  first  n  tosses  is 
smaller  than  the  cost  of  stopping  at  the  nth  toss.  More  precisely,  for  the  optimad  stopping  rule,  we 
recursively  obtain  the  conditional  expected  cost  ^(n.m)  corresponding  to  n  tosses  with  m  successes 
as  follows: 


BASE:  7(n,  A/)=:n.  0  <  n  <  iV, 

7(7V,  m)  =  (d  +  TV),  0<m<A/  — 1, 

RECURSION:  7(n, m)  =  min{ (d  +  n),  ^[7(n  +  1, m)  +  7(n  +  1, m  +  1)]}  . 

The  nonrandom  optimal  stopping  rule  /j  is  now  determined  as  follows: 


/j(n.m) 


1  if  7(n,  m)  =  d  +  n, 
0  otherwise. 


We  define  the  optimal  stopping  boundary  by 


(21) 


(22) 


FJ(n)  =  max{m  :  7(n,m)  =  d  +  n}.  (23) 

Lemma  4.1  The  optimal  stopping  rule  /j  is  consistent  for  any  choice  of  d  >  0. 

Proof:  The  proof  is  by  backward  induction  on  n.  The  base  of  the  recursion  for  the  expected  cost 
gives 


/;(n,M)  =  0,  0<n<M. 

f;(N,m)  =  1,  0<m<M-l. 

Now  by  definition,  fj(n,  Fj{n))  =  1.  We  take  as  inductive  hypothesis  that  fl{k,l)  =  1  for  all 
k  >  n  and  /  <  Fj{n).  It  is  now  easy  to  see  from  (21)  that  /j(n  —  l,m)  =  0  if  m  >  Fj(n)  +  1  and 
fj{n  -  l,m)  =  1  if  m  <  FJ(n)  -  1.  Thus  FJ(n  —  1)  is  either  Fj(n)  or  Fj(n)  -  1.  In  either  case  it 
follows  that  fj{k,l)  =  1  for  all  it  >  n  -  1  and  /  <  FJ(n  -  1).  I 

The  optimal  stopping  problem  described  above  and  the  coin  tossing  game  of  Section  2  are  closely 
related  as  shown  by  the  following  result. 

Lemma  4.2  //d  =  D{N,M)  =  inf(/)g^  ET/,  then  an  optimal  restart  strategy  (/’)  for  the  coin 
tossing  problem  (S)  determines  an  optimal  stopping  rule  /*  for  the  optimal  stopping  problem  (80), 
and  conversely. 
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Remarks:  Lemmas  4.1  and  4.2  together  complete  the  proof  of  Lemma  2.2  so  that  by  Theorem  2.5 
it  follows  that  there  exists  a  parsimonious  optimal  restart  strategy.  Lemma  4.2  and  the  proof  of 
Lemma  4.1  hence  guarantee  that  with  d  =  D(N,M),  the  optimal  stopping  rule  /J  determined 
by  (22)  is  parsimonious. 

Proof:  The  optimal  stopping  problem  (20)  minimises 


E  R/  =  /3^[d  +  E(ry  jSr^  <  Af )]  +  oyE(ry  =  M)  =  0fd  +  EiTj. 


Now  select  d  =  D{N,M)  and  let  (/*)  be  a  stationary,  nonrandom  optimal  strategy  for  the  coin 
tossing  problem  (3).  It  follows  from  (7)  that  D{N,M)  =  ET/-  =  E(r^*)/a/*.  Let  R*  =  inf/  ER/. 
We  claiiii  that  R*  =  E  R/*  =  ET/« .  Clearly,  R'  <  E  R/. .  Observe  now  that  with  the  stopping  rule 
/*,  we  have 

ER/.  =  -^E(r/.)  +  E(r/.)  =  HlZd  =  ET/.. 

O/.  Or/. 


Thus,  for  any  /, 


E  R/  —  E  R/. 


E(r/.)  +  E(r/)- 


End 

Of. 


=  af 


E(^r)' 
«/*  . 


>  0. 


It  follows  that  R*  >  ER/.,  and  hence  R’  =  ER/.  =  ET/. .  Thus,  if  (/')  is  an  optimal  restart 
strategy  for  (3)  then  /*  is  an  optimal  stopping  rule  for  (20). 

To  complete  the  proof,  suppose  that  with  d  =  D{N,M)  =  ET/.,  /  is  an  optimal  stopping  rule 
for  (20):  R*  =  ERy.  We  need  to  show  that  (/)  is  also  an  optimal  restart  strategy  for  (3).  By  (7) 
it  suffices  to  show  that  E(Ty)/ary  =  E(r/.)/a/..  Now  by  assumption  of  optimality  of  the  stopping 
rule  /  it  follows  that 

0i 

R*  =  ERy  =  E(r/.)  +  E(r;). 

•'a/.  •' 

But  R*  =  ET/.  =  E(r/.)/o/..  It  follows  that  E(ry)  =  “^E(r/.).  I 


The  b2u:kward  induction  (21)-(23)  can  now  be  used  to  iteratively  compute  the  boundary  of  a  par¬ 
simonious  optimal  restart  strategy  (/*).  The  approach  followed  here  can  also  be  derived  from  a 
nested  Markov  decision  problem  approach  (cf.  Ross  [4]). 


Algorithm  I  {Backward  Induction)  Given  a  number  of  coins  N,  and  the  desired  number  of  suc¬ 
cesses  M ,  this  algorithm  obtains  the  boundary  of  an  optimal  restart  strategy. 


11.  [Initial  approximation.]  Let  (/o)  be  any  initial  strategy,  and  set  do  =  ET/o-  Set  j  0. 

12.  [Optimal  stopping  ]  Set  d  dy  and  solve  the  associated  optimal  stopping  problem  using  the 

backward  induction  (21)-(23)  to  obtain  the  optimal  stopping  boundary  TJ  (corresponding  to 
the  stopping  rule  /j). 


13.  [Check  for  convergence.]  If  no  =  min{n  ;  /J(n,0)  =  1}  =  0,  then  output  the  (optimal)  boundary 
and  stop. 


14.  [Estimate  expected  duration.]  Set  j  —  j  +  I.  Using  (9)-(12)  set  d,  *-  ETj^iN,  M). 


15. 


[Iteration.]  If  d/  ^  d/_i,  go  back  to  step  12;  otherwise,  if  d/ 
boundary  and  terminate  the  algorithm. 


=  d/_i,  output  the  (optimal| 
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Remarks;  At  epoch  j,  the  cost  dj  for  the  optimal  stopping  problem  is  exactly  the  expected 
duration  corresponding  to  the  optimal  boundary  obtained  for  the  cost  If  now  we  obtain 
no  =  0  for  the  new  optimal  boundary  FJ  ,  then  this  implies  the  game  cannot  be  started,  whieii  in 
turn  implies  a  cost  of  dj  for  the  optimal  stopping  problem.  This,  however,  is  the  cost  associated 
with  the  previous  optimal  boundary  Thus,  encountering  no  =  0  at  epoch  j  in  the  progress  of 

the  algorithm  implies  that  the  expected  risk  dj  is  a  fixed  point  of  the  algorithm,  and  this  expected 
risk  is  exactly  the  expected  duration  of  the  coin  tossing  game  for  a  choice  of  boundary  FJ  . 

Any  convenient  value  for  do  can  be  chosen,  though  the  Ballot  Strategy  yields  a  particularly 
good  starting  point  as  we  will  see  in  the  numerical  simulations  of  Section  5. 


Theorem  4.3  Algorithm  I  converges  to  a  parsimonious  optimal  restart  strategy  (/*)  for  the  coin 

tossing  problem  (3). 


Proof;  We  first  show  that  the  algorithm  converges.  By  step  13  of  the  algorithm,  this  clearly 
happens  if  no  =  0  at  any  point.  Assume  now  that  no  0  at  any  point  in  the  progress  cf  the 
algorithm.  Let  dj  denote  the  vadue  of  d  at  the  jth  iteration  (see  step  12  of  Algorithm  I),  and  let 
be  the  solution  of  the  optimal  stopping  problem  with  d  =  dj.  Then  dj  =  E(Ty.  )/<»/;  •  We 

hence  have 


It  follows  that  dj  >  dj+i  =  E(ryj  )/o}-  ■  Thus,  the  sequence  {</;)  is  decreasing,  and  as  there  are 
only  finitely  many  possible  values  for  d,  the  algorithm  converges. 

Now  let  /•  be  any  fixed  point  of  the  algorithm.  Applying  (6)  inductively  we  obtain  that  for 
any  r,  and  any  choices  of  restart  functions  fi,  1  <  »  <  r,  (/*)  <  (/i, ...  ,/r, /*,/*,••  •)•  Allowing 
r  — .  oo  we  see  that  (/*)  is  an  optimal  strategy  for  the  game  (3).  I 


5  NUMERICAL  SIMULATIONS 

The  Ballot  Strategy  was  observed  to  have  performances  comparable  to  the  optimal  strategy  on 
simulations  over  several  values  of  n.  In  Fig.  6  we  contrast  the  expected  duration  of  the  game  for 
the  Indolent,  Hope  Springs  Eternal,  and  Ballot  Strategies  with  the  minimum  expected  duration  of 
the  game  for  an  optimal  strategy  generated  by  Algorithm  I  using  the  Indolent  Strategy  as  an  initial 
strategy.  (Note  that  rather  large  absolute  differences  in  expected  duration  across  the  strategies 
are  hidden  because  of  the  logarithmic  scale  of  the  plots.)  Note,  that  as  per  parts  (a)  and  (b)  of 
Theorem  3.3,  all  the  strategies  are  essentially  equivalent  when  M  <  N/2,  and  that  it  is  only  in  the 
regime  M  >  N/2  of  expected  exponential  duration  that  it  pays  to  look  for  an  optimal  strategy. 

Numerically,  the  strategy  Step  in  Time  was  found  to  give  results  comparable  to  Hope  Springs 
Eternal  for  large  values  of  L,  and  substantially  poorer  results  for  small  values  of  L.  We  did  not 
attempt  to  optimise  the  value  of  L  in  light  of  the  performance  of  the  Ballot  Strategy.  As  an  aside, 
the  probabilities  pm  given  by  recursion  (10)  need  to  be  evaluated  with  some  care  (especially  for  large 
values  of  Af )  as  the  backward  induction  can  be  numerically  sensitive. 

In  Fig.  7  we  have  shown  plotted  the  restart  boundary  of  the  Ballot  Strategy  compared  with  that 
of  an  optimal  strategy.  Note  the  close  correspondence  of  the  boundaries  of  the  two  strategies.  It 
would  appear  that  the  Ballot  Strategy  is  slightly  more  conservative  in  setting  the  restart  boundary 
than  an  optimal  strategy. 

6  EXTENSIONS 

The  results  of  this  paper  can  be  easily  generalised  to  the  caae  where  biased  coins  are  tossed.  Let 
P{,Y,  =  1 }  =  q,  P{Xi  =  0}  =  1  -  q  =  i/.  The  following  generahsation  of  Lemma  3.2  follows  easily; 
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Figure  6;  The  expected  duration  of  the  coin  toesing  game  is  plotted  (on  a  logarithmic  scale)  versus 
the  desired  number  of  successes  M  for  TV  =  50  for  the  Indolent,  Hope  Springs  Eternal,  Ballot,  and 
Optimal  Strategies. 


Lemma  6.1  The  probabiliiiea  Pmt  0  <  m  <  M  —  I  aatiafy  the  following  recnraion: 

BASE:  po  = 

recursion:  Pm  =  p.  ^  >  j 

A  straightforward  generalisation  of  Theorem  3.3  now  yields: 

Theorem  6.2  Assvme  r;  >  0.  Then,  for  any  consistent  airategy  f, 

ETj(N,M)  =  ^  +  - 

It  is  also  easily  verified  that  Theorem  3.5  holds  unchanged.  A  more  general  form  of  Theorem  3.6  is 
now  readily  obtained. 

Theorem  6.3  Lei  c  >  0  be  any  fixed  parameter  and  N  any  positive  integer.  Then: 

(a)  IfM<{\-  e)r,N,  then  f  <  D{N,M)  <  f  [l  + 

(b)  If  M  >  (I  +  e)rjN ,  then  D{N,  M)  >  jj  . 

Note  that  we  can  again  use  Chernoff’s  bounds  to  obtain  tighter  results. 

Corresponding  to  the  change  in  the  recursion  for  Pm,  the  backward  induction  (21)  has  a  corre¬ 
sponding  change  with  the  recursion  replaced  by 

7(n,  m)  =  min{(d  +  n),  irf{n  1,  m  1)  -1-  «/7(n  +  1,  m)}. 
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Figure  7:  The  boundaries  of  the  Ballot  Strategy  and  an  optimal  strategy  for  N  =  50  and  choices  of 
M  ~  10,  25,  and  40- 


Algorithm  I  continues  to  work  as  before. 

The  problem  in  this  general  form  with  biased  coins  would  seem  to  have  more  direct  relevance 
to  early  abort  strategies  in  randomised  algorithms  for  factoring  integers.  For  instance,  going  back 
to  the  discussion  in  the  Introduction,  consider  the  problem  of  finding  a  factor  of  a  (large)  integer  n. 
Assume  that  we  have  selected  an  integer  m  of  the  order  of  Inn  for  Dixon’s  algorithm.  As  we  saw, 
a  rough  approximation  to  the  problem  in  terms  of  the  coin  tossing  game  is  to  consider  a  store  of 
N  =  x(m)  unfair  coins  with  identical  probabilities  1/m  of  a  toss  resulting  in  a  head  and  require  to 
find  an  optimal  strategy  which  minimises  the  total  number  of  tosses  before  achieving  Af  =  In  n/In  m 
heads.  The  classical  estimate  for  the  number  of  primes  less  than  m, 


shows  that  we  are  in  the  exponential  domain  for  D{N,M),  and  a  ready  cidculation  using  Theo¬ 
rem  6.3(b)  yields 

D{N,  M)  =  e«ln»./Jnlnn  j 

for  a  positive  constant  k.  (Or  make  the  bound  exponentially  tight  by  using  the  sharper  binomial  tail 
bounds.)  This  would  suggest  the  rough  estimate  »»<?(>/ •"•"")  for  the  minimum  (over  all  early  abort 
strategies)  expected  number  of  steps  in  Dixon’s  algorithm  before  one  solution  to  (2)  is  obtained. 
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